Deterministic code compression and indexing for Python repositories
Project description
Why LLMs Struggle With Raw Codebases
LLMs work poorly on real code not because the models are weak, but because code is optimized for human readability, not transformer efficiency.
This mismatch produces predictable failures.
1. Raw Code Wastes Tokens and Buries Meaning
Code encodes intent through formatting, naming, indentation, imports, decorators, boilerplate, and syntactic rituals that carry zero semantic weight to a model.
LLMs must ingest all of this noise token-by-token before they can reach the behavior that actually matters.
The Problem
# What the LLM sees (high token count, low signal)
def get_user_profile_from_database_by_id(user_id: str) -> UserProfile:
"""
Retrieves a user profile from the database given a user ID.
Args:
user_id: The unique identifier for the user
Returns:
UserProfile object containing user data
"""
if user_id is None:
raise ValueError("user_id cannot be None")
# ... 50+ lines of boilerplate
# What actually matters (Behavior IR)
FN USRP C=DBQY,VALD F=EIR A=3 #DB #CORE
→ A function that calls DBQY and VALD, raises exceptions, has conditionals, returns a value. 3 assignments. That's the whole behavioral surface.
2. Context Limits Break Multi-File Reasoning
Large repositories exceed model context windows, which forces the model to reason over fragments.
Without global visibility, it cannot maintain stable understanding of:
- Call relationships — who calls whom?
- Shared invariants — what assumptions cross file boundaries?
- Cross-file constraints — which changes break what?
- Architectural intent — what's the actual design?
CodeIR's bearings file provides module-level architecture in ~200-400 tokens, and Behavior-level IR fits entire codebases in context where raw source cannot.
3. Redundant Variation Inflates Complexity
Equivalent constructs written in different styles look unrelated unless normalized:
| Python | JavaScript | Swift |
|---|---|---|
get_user() |
fetchUser() |
retrieveUser() |
user_data |
userData |
userInfo |
db_query() |
queryDB() |
databaseQuery() |
LLMs treat syntactically different expressions of the same idea as separate concepts, fragmenting reasoning.
After Compression
Stable entity IDs normalize naming:
FN USRG → "get user" (regardless of casing/style)
FN DBQY → "database query" (regardless of language)
4. Structural Information Is Implicit, Not Explicit
Architectural boundaries are buried in syntax:
- Stateful regions
- Async boundaries
- Platform-specific logic
- Critical error paths
LLMs see sequential text, not structure, unless forced into a better representation.
Example: Behavior IR Makes Structure Explicit
# LLM sees: "just another function"
async def process_payment(order_id):
result = await db.query(...)
if not result:
raise PaymentError("not found")
return result
# Behavior IR surfaces the structure
AMT PRCSPYMNT C=PaymentError,db.query F=AEIR A=1 #DB #CORE
→ Async method, awaits, raises, has conditionals, returns. The async boundary, error path, and DB dependency are all explicit.
5. Token-Heavy Regions Distort Importance
Large helper functions, repeated boilerplate, and verbose patterns dominate attention even when they are unimportant.
The model cannot prioritize high-impact architectural nodes.
The Attention Problem
# 300 lines of logging boilerplate
def setup_logging_configuration():
# ... consumes 800+ tokens ...
# 5 lines of critical business logic
def validate_payment():
# ... only 50 tokens, but this is what matters ...
The LLM spends most of its attention on noise, not signal.
With IR Compression
FN STPLG F=A A=15 #CORE # 8 tokens
FN VLDPYMNT C=BNKP,FRDCHK F=EIR #CORE # 12 tokens
Equal representation regardless of verbosity. The model sees both at the same resolution.
Why Semantic Compression Matters
Semantic compression makes structure explicit and collapses unnecessary variation, allowing LLMs to operate where they're strongest:
| Capability | Raw Code | With CodeIR |
|---|---|---|
| Pattern detection | Fragmented by syntax | Normalized and clear |
| Architectural reasoning | Implicit, buried | Explicit, structured |
| Relational understanding | Context-limited | Graph-based, complete |
| Token efficiency | ~100x more tokens | Behavior IR at ~3-5% of source |
Instead of parsing noise, the model operates on a consistent, low-entropy substrate.
The CodeIR Approach
IR is the operating system. Raw code is just a UI.
By transforming code into a deterministic, compressed, structure-first representation, we give LLMs the substrate they need to reason effectively about real-world software.
Related Documentation
- Main README — Project overview
- IR Spec (As Built) — Technical details
- Future Considerations — Planned expansions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codeir_tools-0.2.0.tar.gz.
File metadata
- Download URL: codeir_tools-0.2.0.tar.gz
- Upload date:
- Size: 54.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
005a64519a0c973b4111c748561b7ee61bb527e31e9c25af3cb620f1448e4253
|
|
| MD5 |
99d7ca2dd2a6e7c73cf7b6b1a22db94b
|
|
| BLAKE2b-256 |
57a4cd3b50397c85f1d0e72d9ba5c7f661fafcc8e922031b9aa7b7b97238ef87
|
File details
Details for the file codeir_tools-0.2.0-py3-none-any.whl.
File metadata
- Download URL: codeir_tools-0.2.0-py3-none-any.whl
- Upload date:
- Size: 57.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04ece9d4d910991777905cfa892de704eed4f5d80e1d21344e305165c5b8764d
|
|
| MD5 |
8959e5a0bd347603571827ccc4135661
|
|
| BLAKE2b-256 |
22bd0aa723c0bfaf8767914c3b566695ab93cfeb5021bc051a21abf569f7cb80
|