Python-based statistical scripting language with Jupyter notebook support
Project description
StatLang
An open-source, Python-based statistical scripting language
Write and run statistical scripts with full syntax highlighting and a Python backend.
Overview
StatLang provides an open-source environment for statistical analysis by offering:
- Expressive scripting syntax for data manipulation and analysis
- Python backend for execution and performance
- Jupyter notebook support with a StatLang kernel
- VS Code extension with syntax highlighting and execution
- Cross-platform compatibility (Windows, macOS, Linux)
- Open source and free to use
๐ What Makes StatLang Special?
- ๐ค AI Integration: Built-in PROC LANGUAGE with LLM capabilities for intelligent data analysis
- ๐ง Complete ML Pipeline: From data exploration to model deployment using familiar, concise syntax
- ๐พ Modern SQL: PROC SQL powered by DuckDB for high-performance data querying
- ๐ง Robust language features: Macro system, format system, and statistical procedures
- ๐ Rich Visualizations: Professional output formatting with TITLE statements and structured results
Features
Core Interpreter
- Scripting-based DATA step functionality with inline data support
- Statistical procedures (MEANS, FREQ, SORT, PRINT)
- Concise data manipulation and analysis syntax
- Python pandas/numpy backend for performance
- Clean, professional output with familiar formatting
Jupyter Notebook Support
-
StatLang kernel for Jupyter notebooks
-
Interactive statistical programming in notebook environment
-
Rich output display with formatted tables
-
Dataset visualization and exploration
-
VS Code Extension
-
Syntax highlighting for
.statlangfiles -
Code snippets for common statistical analysis patterns
-
File execution directly from VS Code
-
Notebook support for interactive analysis
Supported Features
๐ Statistical Procedures
- PROC MEANS: Descriptive statistics with CLASS variables and OUTPUT statements
- PROC FREQ: Frequency tables and cross-tabulations with options
- PROC SORT: Data sorting with ascending/descending order
- PROC PRINT: Data display and formatting
- PROC REG: Linear regression analysis with MODEL, OUTPUT, and SCORE statements
- PROC UNIVARIATE: Detailed univariate analysis with distribution diagnostics
- PROC CORR: Correlation analysis (Pearson, Spearman)
- PROC FACTOR: Principal component analysis and factor analysis
- PROC CLUSTER: Clustering methods (k-means, hierarchical)
- PROC NPAR1WAY: Nonparametric tests (Mann-Whitney, Kruskal-Wallis)
- PROC TTEST: T-tests (independent and paired)
- PROC LOGIT: Logistic regression modeling
- PROC TIMESERIES: Time series analysis and seasonal decomposition
- PROC SURVEYSELECT: Random sampling with SRS method, SAMPRATE/N options, and OUTALL flag
๐ค Machine Learning Procedures
- PROC TREE: Decision trees for classification and regression
- PROC FOREST: Random forests for ensemble learning
- PROC BOOST: Gradient boosting for advanced modeling
๐ป Advanced Features
- PROC SQL: SQL query processing with DuckDB backend
- PROC LANGUAGE: Built-in LLM integration for text generation, Q&A, and data analysis
- Macro System: Complete macro facility with %MACRO/%MEND, %LET, & substitution, %PUT, %IF/%THEN/%ELSE, %DO/%END
- Format System: Built-in date/time, numeric, and currency formats with metadata persistence
- TITLE Statements: Professional output formatting
๐ง Core Data Processing
- DATA Steps: Variable creation, conditional logic, DATALINES input
- Macro variables: %LET, %PUT statements
- Libraries: LIBNAME functionality
- NOPRINT option: Silent execution for procedures
Installation
Python Package
pip install statlang
Jupyter Kernel Installation
# Install the StatLang kernel
python -m statlang.kernel install
# List available kernels
jupyter kernelspec list
VS Code Extension
- Install from VS Code Marketplace: "StatLang" by RyanBlakeStory
- Or install from source (see Development section)
๐ Exciting New Features
๐ค LANGUAGE - AI-Powered Analysis
language prompt="Analyze the correlation between income and spending in our dataset";
run;
Built-in LLM integration for text generation, Q&A, and intelligent data analysis using Hugging Face transformers!
๐ง Complete Machine Learning Workflow
Check out our ML Project Demo - a comprehensive regression analysis project showcasing:
- PROC UNIVARIATE for distribution exploration
- PROC SURVEYSELECT for train/test splitting
- PROC REG with MODEL, OUTPUT, and SCORE statements
- Macro system for reusable analysis workflows
- Complete ML pipeline in pure StatLang syntax
๐พ SQL - Modern Data Querying
sql;
select age, income, spend,
case when income > 60000 then 'High' else 'Low' end as income_group
from work.customers
where age between 25 and 50
order by income desc;
quit;
DuckDB-powered SQL processing with full dataset integration!
Quick Start
1. Interactive Python Usage
from statlang import StatLangInterpreter
# Create interpreter
interpreter = StatLangInterpreter()
# Create sample data using StatLang syntax
interpreter.run_code('''
data work.employees;
input employee_id name $ department $ salary;
datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
;
run;
''')
# Run statistical analysis
interpreter.run_code('''
proc means data=work.employees;
class department;
var salary;
run;
''')
2. Jupyter Notebook Usage
- Install the StatLang kernel:
python -m statlang.kernel install
- Create a new Jupyter notebook (
.ipynb) - Select "statlang" as the kernel
- Write StatLang code in cells and execute
3. VS Code Usage
- Install the StatLang extension from the marketplace
- Create a new file with
.statlangextension - Write your StatLang code
- Use
Ctrl+Shift+Pโ "StatLang: Run File" to execute
4. Command Line Usage
# Run StatLang code from file
python -m statlang.cli run example.statlang
# Interactive mode
python -m statlang.cli interactive
๐ Examples & Demos
๐ฏ Complete ML Project
ML Project Demo - A comprehensive machine learning workflow:
- Synthetic dataset creation with 30 observations
- PROC UNIVARIATE for distribution analysis
- PROC SURVEYSELECT for train/test splitting (70/30)
- PROC REG with MODEL, OUTPUT, and SCORE statements
- Macro-based reusable analysis functions
- Complete regression analysis pipeline
๐ Comprehensive Walkthrough
StatLang Walkthrough - Complete feature demonstration:
- All statistical procedures with examples
- Macro system demonstrations
- Format system usage
- Advanced data manipulation techniques
- Real-world analysis scenarios
Project Structure
StatLang/
โโโ stat_lang/ # Core Python package
โ โโโ __init__.py
โ โโโ interpreter.py # Main statistical interpreter
โ โโโ cli.py # Command line interface
โ โโโ kernel/ # Jupyter kernel implementation
โ โ โโโ statlang_kernel.py # Main kernel
โ โ โโโ install.py # Kernel installation
โ โโโ parser/ # Syntax parser
โ โ โโโ data_step_parser.py
โ โ โโโ proc_parser.py
โ โ โโโ macro_parser.py
โ โโโ procs/ # Statistical procedure implementations
โ โ โโโ proc_means.py
โ โ โโโ proc_freq.py
โ โ โโโ proc_sort.py
โ โ โโโ proc_print.py
โ โโโ utils/ # Utility functions
โ โโโ expression_evaluator.py
โ โโโ data_utils.py
โ โโโ libname_manager.py
โโโ vscode-extension/ # VS Code extension
โโโ examples/ # Example files and demo notebook
โโโ media/ # Logo and icons
โโโ setup.py # Package setup
โโโ README.md
Development
Setup Development Environment
git clone https://github.com/ryan-story/StatLang.git
cd StatLang
pip install -e .
Running Tests
# Run basic functionality tests
python -c "from statlang import StatLangInterpreter; print('StatLang loaded successfully')"
Key Features Implemented
โ Completed Features
- Core DATA step implementation with DATALINES
- Statistical procedures with CLASS variables and OUTPUT statements
- Frequency analysis with cross-tabulations and options
- Data sorting with ascending/descending order
- Data display and formatting
- Linear regression analysis with PROC REG
- Random sampling with PROC SURVEYSELECT
- Silent execution options
- Jupyter notebook kernel
- VS Code extension with syntax highlighting
- Clean, professional output
- Concise behavior and syntax
๐ง Future Enhancements
- Additional statistical procedures (SQL queries, advanced regression, etc.)
- Advanced macro functionality
- Performance optimizations
- Enhanced data connectivity options
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Areas for Contribution
- Additional statistical procedures
- Macro functionality enhancements
- Performance optimizations
- VS Code extension features
- Documentation and examples
License
MIT License - see LICENSE for details.
Support
- ๐ Documentation
- ๐ Issue Tracker
- ๐ฌ Discussions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file statlang-0.1.2.tar.gz.
File metadata
- Download URL: statlang-0.1.2.tar.gz
- Upload date:
- Size: 80.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01c43e69ce444e1c11fe725e2da1ffe84cc045174ff3156eab41d492d3b8a3e9
|
|
| MD5 |
ef5fade7aa3b34c07410f3c7cf94c69c
|
|
| BLAKE2b-256 |
5430d27b171162538edb16002c8fd88287822bd74263db657d57ddda24e1aab5
|
File details
Details for the file statlang-0.1.2-py3-none-any.whl.
File metadata
- Download URL: statlang-0.1.2-py3-none-any.whl
- Upload date:
- Size: 103.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53612b70953d07987f0583803b77df4bcdb85eafab9dbff43fcc77efdb5b1f39
|
|
| MD5 |
25878bd9ea5537dc4e7a82690bd76969
|
|
| BLAKE2b-256 |
2503b2e2ef818643318368843699690b9e211048ae54ea87b9083a80ae93c7ee
|