Skip to main content

CEDARScript grammar.js for tree-sitter

Project description

CEDARScript

A SQL-like language for efficient code analysis, transformations, and tool use. Most useful for AI code assistants.

PyPI version Python Versions Code style: black AGPL v3

Table of Contents

What is CEDARScript?

It's a domain-specific language designed to improve how AI coding assistants interact with codebases and communicate their code modification intentions.

It provides a standardized way to express complex code modification and analysis operations, making it easier for AI-assisted development tools to understand and execute these tasks.

It also works as a gateway to external tools, so that the LLM can easily call local shell commands, external HTTP API endpoints, etc

How to use it

  1. You can easily install a tool that supports CEDARScript.
  2. Then, just ask the AI assistant to fix a bug or something in your codebase.

The assistant will write CEDARSCript commands that will be executed by the CEDARScript runtime editor.

CEDARScript ELI5'ed

The Magical Librarian analogy

Imagine a vast library (your codebase) with millions of books (files) across thousands of shelves (directories). Traditional code editing is like manually searching through each book, line by line, character by character, to find relevant information or make changes.

CEDARScript, on the other hand, is like having a magical librarian with superpowers, like:

  1. TurboKognition Boost (Code Analysis):
    • This librarian can act as an Omniscient Cataloger who can instantly tell you where any piece of information is located across all books.
    • Want to know every place where a specific protagonist (function) is mentioned Or where he/she was born? Or find all the chapters (classes) that discuss a particular topic (variable usage)? The librarian provides this information immediately, without having to flip through pages (waste precious tokens)
  2. The GanzPunktGenau Editing Powers (Code Manipulation):
    • When you want to make changes, instead of specifying exact page and line numbers, you can give high-level instructions. For example, "Add this new paragraph after the first mention of 'dragons' in the fantasy section" or "Move the chapter about 'time travel' to come before 'parallel universes' in all science fiction books." The librarian understands these abstract instructions and makes the precise edits across all relevant books, handling details like page layout and consistent formatting.

This magical librarian (CEDARScript) collaborates with the LLM and allows it to assume the role of an Architect who can work with your vast library of code at a higher level, making both understanding and modifying your codebase faster and more intuitive. It bridges the gap between the LLM's high-level intent and the nitty-gritty details of code structure, allowing the architect to focus on the 'what' while it handles the 'how' of code analysis and modification.

Audio overview / Podcasts There are a few podcasts discussing CEDARScript you can listen to:

  1. Aider and the CEDARScript Advantage (~18 minutes)
  2. AI coding assistants and the Magical Librarian
  3. CEDARScript's TurboKognition and GanzPunktGenau editing
  4. Discussion of an LLM chat held during a benchmark and some command examples

Technical Overview

CEDARScript (Concise Examination, Development, And Refactoring Script) is a SQL-like language designed to lower costs and improve the efficiency and accuracy of AI code assistants. It enables offloading low-level code syntax and structure concerns, such as indentation and line counting, from the LLMs. It aims to improve how AI coding assistants interact with codebases and communicate their code modification intentions by providing a standardized and concise way to express complex code analysis and modification operations, making it easier for AI-assisted development tools to understand and execute these tasks.

CEDARScript transforms LLMs from code writers into code architects.

The Architect doesn't need to specify every tiny detail - instead of spending expensive tokens writing out complete code changes, it simply provides high-level blueprints using CEDARScript commands like UPDATE FILE "main.py" MOVE FUNCTION "execute" INSERT AFTER FUNCTION "plan".

This division of labor between the architect and CEDARScript is not just efficient - it's economical. The Architect (LLM) conserves valuable resources (tokens) by focusing on strategic decisions rather than character- or line-level editing tasks.

The CEDARScript runtime then handles all the minute details - precise line numbers, indentation counts, and syntax consistency - at zero token cost.

Let's get to know the 3 primary functions offered by CEDARScript:

  1. Code Analysis to quickly get to know a large code base without having to read all contents of all files.
    • The CEDARScript runtime searches through the whole code base and only returns the relevant results, thus reducing the token traffic between the LLM and the user;
    • This can be used to more quickly understand key aspects of the codebase, search for all or specific identifiers (classes, methods, functions or variables) defined across ALL files of the project or in specific ones, etc.
    • Search results can include not only identifier definitions (in whole or only the signature or summary), but also call-sites and usages of an identifier;
      • These results can be useful not only when the LLM needs to read them, but also when the LLM wants to show some parts of the code to the user (why send a function to the user if the LLM can simply SELECT it and have the CEDARScript runtime show the contents?)
  2. Code Manipulation and Refactoring:
    • The CEDARScript runtime bears the brunt of file editing by locating the exact line numbers and characters to change, which indentation levels to apply to each line and so on, allowing the CEDARScript commands to focus instead on higher levels of abstraction, like identifier names, line markers, relative indentations and positions (AFTER, BEFORE, INTO a function, its BODY, at the TOP or BOTTOM of it...)
  3. Tool Use: The runtime acts as a gateway through which the LLM can send and receive information. This opens up many possibilities.

Key Features:

  • Learning Curve
    • For humans: its SQL-like syntax allows for intuitive code querying and manipulation (however, humans don't even need to learn it, as its primary purpose is to offer LLMs an easy language with which they can write simple, concise commands to modify code or analyse it);
    • For AIs: some prompt engineering is enough to enable most LLMs (even cheaper ones like Gemini Flash) to learn it well. Other forms of fine-tuning are planned, so that even SLMs (Small Language Models) like Microsoft's Phi 3 could be able to learn CEDARScript. This has the potential to unlock locally-deployed SLMs to be used as AI code assistants.
  • Shows improved results in refactoring benchmarks when compared to standard diff formats
  • Reduced token usage via semantic-level code transformations, not character-by-character matching;
    • Scalable to larger codebases with minimal token usage;
    • Project-wide refactorings can be performed with a single, concise command
    • Avoids wasted time and tokens on failed search/replace operations caused by misplaced spaces, indentations or typos;
  • High-level abstractions for complex refactoring operations via refactoring languages (currently supports Rope syntax);
  • Relative indentation for easily maintaining proper code structure;
  • Allows fetching or modifying targeted parts of code;
  • Locations in code: Doesn't use line numbers. Instead, offers more resilient alternatives, like:
    • Line markers. Ex:
      • LINE "if name == 'some name':"
    • Identifier markers (VARIABLE, FUNCTION, CLASS). Ex:
      • FUNCTION 'my_function'
  • Language-agnostic design for versatile code analysis
  • Code analysis operations return results in XML format for easier parsing and processing by LLM (Large Language Model) systems.

Supported Languages

Currently, CEDARScript theoretically supports Python, Kotlin, PHP, Rust, Go, C++, C, Java, Javascript, Lua, FORTRAN, Scala and C#, but only Python has been tested so far.

Cobol and MatLab: Initial queries for these languages are ready, but the Tree-Sitter parsers for them still need to be included.

Projects using the CEDARScript Language

  1. CEDARScript Integration: Aider - Provides CEDARScript edit format for Aider
  2. CEDARScript AST Parser (Python)
  3. CEDARScript Editor
  4. CEDARScript Prompt Engineering
    • Provides prompts that teach CEDARScript to LLMs
    • Also includes real conversations held via Aider in which an LLM uses this language to propose code modifications

How can CEDARScript be used?

Improving LLM <-> codebase interactions

CEDARScript can be used as a way to standardize and improve how AI coding assistants interact with codebases, learn about your code, and communicate their code modification intentions while keeping token usage low. This efficiency allows for more complex operations within token limits.

It provides a concise way to express complex code modification and analysis operations, making it easier for AI-assisted development tools to understand and perform these tasks.

Codebase Interaction Examples

Quick example: turn a method into a top-level function, using CASE filter with REGEX:

UPDATE FILE "baseconverter.py"
MOVE FUNCTION "convert"
INSERT BEFORE class "BaseConverter"
  RELATIVE INDENTATION 0;

-- Update the call sites in encode() and decode() methods to use the top-level convert() function
UPDATE CLASS "BaseConverter"
  FROM FILE "baseconverter".py
REPLACE BODY
WITH CASE -- Filter each line in the function body through this CASE filter
  WHEN   REGEX r"self\.convert\((.*?)\)"
  THEN REPLACE r"convert(\1)"
END;

Use an ED script to change a function:

UPDATE FILE "app/main.py" REPLACE FUNCTION "calculate_total" WITH ED '''
-- Add type hints to parameters
1s/calculate_total(base_amount, tax_rate, discount, apply_shipping)/calculate_total(base_amount: float, tax_rate: float, discount: float, apply_shipping: bool) -> float/

-- Add docstring after function definition
1a
    """
    Calculate the total amount including tax, shipping, and discount.

    Args:
        base_amount: Base price of the item
        tax_rate: Tax rate as decimal (e.g., 0.1 for 10%)
        discount: Discount as decimal (e.g., 0.2 for 20%)
        apply_shipping: Whether to add shipping cost

    Returns:
        float: Final calculated amount rounded to 2 decimal places
    """
.

-- Add logging before return
/return/i
    logger.info(f"Calculated total amount: {subtotal:.2f}")
.
''';

There are many more examples to look at...

Use as a refactoring language / diff format

One can use CEDARScript to concisely and unambiguously represent code modifications at a higher level than a standard diff format can.

IDEs can store the local history of files in CEDARScript format, and this can also be used for searches.

Tool Use

If explicit configuration is set, the CEDARScript runtime can act as a gateway through which an LLM can:

  1. Call local commands (ls, grep, find, open)
  2. Run scripts
  3. Call external HTTP API services
  4. See the user's screen and take control of the mouse and keyboard
  5. Possibilities are numerous...

The output from the external tool is captured and sent back to the LLM.

Tool Use Examples

Run Python scripts to find the correct answer for certain types of problems

-- Suppose the LLM has difficulty counting letters...
-- It can delegate the counting to a Python script:
CALL LANGUAGE "python" WITH CONTENT '''
print("Refrigerator".lower().count('r'))
''';
-- Using env var
CALL LANGUAGE "python"
ENV CONTENT '''WORD=Refrigerator'''
WITH CONTENT '''
import os
print(os.environ['WORD'].count('r'))
''';
-- Using env var from the host computer
CALL LANGUAGE "python"
ENV INHERIT ONLY 'WORD'
WITH CONTENT '''
import os
print(os.environ['WORD'].count('r'))
''';

Obtain the current local weather

CALL COMMAND
ENV INHERIT ONLY 'LOCATION' -- Get the current location from the host env var
WITH CONTENT r'''
#!/bin/bash
curl -s "wttr.in/$LOCATION?format=%l:+%C+%t,+feels+like+%f,+%h+humidity"
''';

Get a list of image files in the current working dir

CALL LANGUAGE "bash"
WITH CONTENT r'''
    find . -type f -name "*.jpg"
''';

Take a peek at the user's screen and right-click on the user's clock widget

CALL LANGUAGE "python"
WITH CONTENT r'''
import pyautogui
import time
from datetime import datetime
import os

# Take screenshot and save it
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
screenshot_path = f"screen_{timestamp}.png"
pyautogui.screenshot(screenshot_path)

# Print the path so the LLM can analyze the image
print(f"IMAGE_PATH={screenshot_path}")
''';

After the LLM takes a look at the screenshot, it finds the clock and sends a mouse click:

CALL LANGUAGE "python"
ENV r'''
X=1850  # Coordinates provided by LLM after image analysis
Y=12    # Coordinates provided by LLM after image analysis
'''
WITH CONTENT r'''
import pyautogui
import os

# Get coordinates from environment
x = int(os.environ['X'])
y = int(os.environ['Y'])

# Move and click
pyautogui.moveTo(x, y, duration=1.0)
pyautogui.click()
print(f"Clicked at ({x}, {y})")
''';

Other Ideas to Explore

  • Code review systems for automated, in-depth code assessments
  • Automated code documentation and explanation tools
  • ...

Proposals

See current proposals

Related

  1. .QL - Object-oriented query language that enables querying Java source code using SQL-like syntax;
  2. JQL (Java Query Language) - Allows querying Java source code with SQL. It's designed for Java code analysis and linting;
  3. Joern - While primarily focused on C/C++, Joern is an open-source code analysis platform that uses a custom graph database to store code property graphs. It allows querying code using a Scala-based domain-specific language;
  4. Codebase Context Suite - A comprehensive tool for managing codebase context, generating prompts, and enhancing development workflows;
  5. CONVENTIONS.md

See Also

  1. OpenAI Fine-tuning
  2. llm-context.py
  3. Gemini 1.5 PRO improved performance (on par with Sonnet 3.5)

Unrelated

  1. Cedar Policy Language ('CEDARScript' is not a policy language. 'Cedar' and 'CEDARScript' are totally unrelated.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cedarscript_grammar-0.5.2.tar.gz (136.4 kB view details)

Uploaded Source

Built Distribution

cedarscript_grammar-0.5.2-py3-none-any.whl (106.9 kB view details)

Uploaded Python 3

File details

Details for the file cedarscript_grammar-0.5.2.tar.gz.

File metadata

  • Download URL: cedarscript_grammar-0.5.2.tar.gz
  • Upload date:
  • Size: 136.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for cedarscript_grammar-0.5.2.tar.gz
Algorithm Hash digest
SHA256 ab4bbf969a369b0c626a43d736cedfd010e007716c43a6745d66c16a231b2ea9
MD5 808aeac0ecc29496cdd3df269e5782b1
BLAKE2b-256 0461fce53f7f825000dca0af550ac375381633465f8603736fe9599db25eb850

See more details on using hashes here.

File details

Details for the file cedarscript_grammar-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for cedarscript_grammar-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 37601b67f954fd435f4d28b2279aa3e11ebcdf4c82d1add525bfe4dd54baeb63
MD5 07d2863804eea4f843af918c15af0179
BLAKE2b-256 6a5eda704aa6d2f07aa2a7ef7162c84e03d8e6ca5f4e1d73ff69946f91e93040

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page