Skip to main content

Deterministic code compression and indexing for Python repositories

Project description

Why LLMs Struggle With Raw Codebases

LLMs work poorly on real code not because the models are weak, but because code is optimized for human readability, not transformer efficiency.

This mismatch produces predictable failures.


1. Raw Code Wastes Tokens and Buries Meaning

Code encodes intent through formatting, naming, indentation, imports, decorators, boilerplate, and syntactic rituals that carry zero semantic weight to a model.

LLMs must ingest all of this noise token-by-token before they can reach the behavior that actually matters.

The Problem

# What the LLM sees (high token count, low signal)
def get_user_profile_from_database_by_id(user_id: str) -> UserProfile:
    """
    Retrieves a user profile from the database given a user ID.

    Args:
        user_id: The unique identifier for the user

    Returns:
        UserProfile object containing user data
    """
    if user_id is None:
        raise ValueError("user_id cannot be None")

    # ... 50+ lines of boilerplate
# What actually matters (Behavior IR)
FN USRP C=DBQY,VALD F=EIR A=3 #DB #CORE

→ A function that calls DBQY and VALD, raises exceptions, has conditionals, returns a value. 3 assignments. That's the whole behavioral surface.


2. Context Limits Break Multi-File Reasoning

Large repositories exceed model context windows, which forces the model to reason over fragments.

Without global visibility, it cannot maintain stable understanding of:

  • Call relationships — who calls whom?
  • Shared invariants — what assumptions cross file boundaries?
  • Cross-file constraints — which changes break what?
  • Architectural intent — what's the actual design?

CodeIR's bearings file provides module-level architecture in ~200-400 tokens, and Behavior-level IR fits entire codebases in context where raw source cannot.


3. Redundant Variation Inflates Complexity

Equivalent constructs written in different styles look unrelated unless normalized:

Python JavaScript Swift
get_user() fetchUser() retrieveUser()
user_data userData userInfo
db_query() queryDB() databaseQuery()

LLMs treat syntactically different expressions of the same idea as separate concepts, fragmenting reasoning.

After Compression

Stable entity IDs normalize naming:

FN USRG → "get user"         (regardless of casing/style)
FN DBQY → "database query"   (regardless of language)

4. Structural Information Is Implicit, Not Explicit

Architectural boundaries are buried in syntax:

  • Stateful regions
  • Async boundaries
  • Platform-specific logic
  • Critical error paths

LLMs see sequential text, not structure, unless forced into a better representation.

Example: Behavior IR Makes Structure Explicit

# LLM sees: "just another function"
async def process_payment(order_id):
    result = await db.query(...)
    if not result:
        raise PaymentError("not found")
    return result
# Behavior IR surfaces the structure
AMT PRCSPYMNT C=PaymentError,db.query F=AEIR A=1 #DB #CORE

→ Async method, awaits, raises, has conditionals, returns. The async boundary, error path, and DB dependency are all explicit.


5. Token-Heavy Regions Distort Importance

Large helper functions, repeated boilerplate, and verbose patterns dominate attention even when they are unimportant.

The model cannot prioritize high-impact architectural nodes.

The Attention Problem

# 300 lines of logging boilerplate
def setup_logging_configuration():
    # ... consumes 800+ tokens ...

# 5 lines of critical business logic
def validate_payment():
    # ... only 50 tokens, but this is what matters ...

The LLM spends most of its attention on noise, not signal.

With IR Compression

FN STPLG F=A A=15 #CORE              # 8 tokens
FN VLDPYMNT C=BNKP,FRDCHK F=EIR #CORE  # 12 tokens

Equal representation regardless of verbosity. The model sees both at the same resolution.


Why Semantic Compression Matters

Semantic compression makes structure explicit and collapses unnecessary variation, allowing LLMs to operate where they're strongest:

Capability Raw Code With CodeIR
Pattern detection Fragmented by syntax Normalized and clear
Architectural reasoning Implicit, buried Explicit, structured
Relational understanding Context-limited Graph-based, complete
Token efficiency ~100x more tokens Behavior IR at ~3-5% of source

Instead of parsing noise, the model operates on a consistent, low-entropy substrate.


The CodeIR Approach

IR is the operating system. Raw code is just a UI.

By transforming code into a deterministic, compressed, structure-first representation, we give LLMs the substrate they need to reason effectively about real-world software.


Related Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codeir_tools-0.2.0.tar.gz (54.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codeir_tools-0.2.0-py3-none-any.whl (57.2 kB view details)

Uploaded Python 3

File details

Details for the file codeir_tools-0.2.0.tar.gz.

File metadata

  • Download URL: codeir_tools-0.2.0.tar.gz
  • Upload date:
  • Size: 54.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for codeir_tools-0.2.0.tar.gz
Algorithm Hash digest
SHA256 005a64519a0c973b4111c748561b7ee61bb527e31e9c25af3cb620f1448e4253
MD5 99d7ca2dd2a6e7c73cf7b6b1a22db94b
BLAKE2b-256 57a4cd3b50397c85f1d0e72d9ba5c7f661fafcc8e922031b9aa7b7b97238ef87

See more details on using hashes here.

File details

Details for the file codeir_tools-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: codeir_tools-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 57.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for codeir_tools-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 04ece9d4d910991777905cfa892de704eed4f5d80e1d21344e305165c5b8764d
MD5 8959e5a0bd347603571827ccc4135661
BLAKE2b-256 22bd0aa723c0bfaf8767914c3b566695ab93cfeb5021bc051a21abf569f7cb80

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page