The official Python SDK for Codellm-Devkit.
Project description
Codellm-Devkit (CLDK)
A unified, multilingual program-analysis SDK for Code LLMs. CLDK turns raw source code into structured, LLM-ready program facts — symbol tables, call graphs, type hierarchies, and more — behind a single Python API, so you can build analysis-augmented LLM pipelines without wrangling a different static-analysis tool for every language.
Under the hood, CLDK orchestrates mature analysis engines (WALA, Tree-sitter, Jedi, CodeQL, ts-morph) and normalizes their output into consistent, typed Pydantic models. You get the same ergonomic interface whether you are analyzing Java, Python, or TypeScript.
CLDK is:
- Unified — one framework and one mental model across languages and analysis backends.
- Extensible — designed to take on new languages, engines, and graph backends (e.g. Neo4j).
- Streamlined — raw code in, structured LLM-ready facts out, with the tooling complexity hidden.
Developed at IBM Research. CLDK is an actively evolving project — issues and contributions are welcome.
Cited By
CLDK (Krishna et al., 2024) is used and cited in a growing body of research on program analysis and code LLMs:
- RECON: An LLM-Enhanced Backward Constraint Analysis Framework — Bappah et al. (2026). arXiv:2606.10264
- Architecting Open, Accountable, and Trustworthy AI-IDEs — Contreras, Guerra & de Lara (2026). Automated Software Engineering. doi:10.1007/s10515-026-00608-x
- Resolving Java Code Repository Issues with iSWE Agent — Ganhotra et al. (2026). arXiv:2603.11356
- HookLens: Visual Analytics for Understanding React Hooks Structures — Hwang et al. (2026). IEEE PacificVis. arXiv:2602.17891
- PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis — Cui, Krishna & Jha et al. (2025). arXiv:2512.22113
- Examining Software Developers' Needs for Privacy Enforcing Techniques: A Survey — Theophilou & Kapitsaki (2025). ACM SAC. arXiv:2512.14756
- LLM as an Execution Estimator: Recovering Missing Dependency for Practical Time-Travelling Debugging — Pei, Wang & Zhang et al. (2025). arXiv:2508.18721
- Agentic Multi-Modal LLMs for Software Comprehension: Structuring Code Summarization with Business Process Awareness — Tamilselvam & Saxena (2025). IEEE SSE. doi:10.1109/SSE67621.2025.00024
- Phaedrus: Predicting Dynamic Application Behavior with Lightweight Generative Models and LLMs — Chatterjee, Jadhav & Pande (2024). PACMPL (OOPSLA). arXiv:2412.06994
List compiled from Semantic Scholar / OpenAlex citation data; please open a PR to add a missing paper.
Table of Contents
- Cited By
- Installation
- Quick Start
- Supported Languages & Backends
- Architecture
- Documentation
- Contributing
- Citation
- Maintainers
Installation
pip install cldk
Optional extras:
pip install "cldk[neo4j]" # read-only Neo4j graph backend (Java / Python / TypeScript)
Quick Start
Create a language-specific analysis facade with the per-language factory methods, then query it:
from cldk import CLDK
# Pick a language — each returns a typed analysis facade.
analysis = CLDK.java(project_path="/path/to/java/project")
# analysis = CLDK.python(project_path="/path/to/python/project")
# analysis = CLDK.typescript(project_path="/path/to/ts/project")
Walk the symbol table and pull method bodies:
from cldk import CLDK
analysis = CLDK.java(project_path="/path/to/java/project")
for file_path, class_file in analysis.get_symbol_table().items():
for type_name, type_declaration in class_file.type_declarations.items():
for method in type_declaration.callable_declarations.values():
body = analysis.get_method_body(method.declaration)
print(f"{type_name}.{method.declaration}\n{body}\n")
Build a call graph by raising the analysis level:
from cldk import CLDK
from cldk.analysis import AnalysisLevel
analysis = CLDK.python(
project_path="/path/to/python/project",
analysis_level=AnalysisLevel.call_graph,
)
call_graph = analysis.get_call_graph() # a networkx.DiGraph
Select a backend by passing a typed config. For example, query a pre-populated graph read-only over Neo4j (no source or analyzer run needed):
from cldk import CLDK
from cldk.analysis.commons.backend_config import Neo4jConnectionConfig
analysis = CLDK.python(
backend=Neo4jConnectionConfig(
uri="bolt://localhost:7687",
application_name="my-app", # the graph is populated out of band
),
)
classes = analysis.get_all_classes()
Deprecation: the old
CLDK(language="java").analysis(...)entry point still works as a thin compatibility shim (it emits aDeprecationWarning). Prefer theCLDK.java()/CLDK.python()/CLDK.typescript()factory methods.
Supported Languages & Backends
Each language is analyzed by a dedicated codeanalyzer-* engine; CLDK normalizes the result into typed models exposed through the same API. All three also support an optional read-only Neo4j backend — pass a Neo4jConnectionConfig and the SDK answers the same queries with Cypher over a graph the analyzer populates out of band (--emit neo4j).
| Language | Analysis engine | What it provides |
|---|---|---|
| Java | codeanalyzer-java |
WALA + JavaParser. Bytecode-level call graphs, type hierarchies, symbol resolution, CRUD-operation and entry-point detection. Optional read-only Neo4j graph backend. |
| Python | codeanalyzer-python |
Jedi with optional CodeQL augmentation. Symbol tables, call graphs, and class/method resolution. Optional read-only Neo4j graph backend. |
| TypeScript / JavaScript | codeanalyzer-typescript |
ts-morph with Jelly-based call graphs. Symbols, call graph, types, decorators, and call sites. Optional read-only Neo4j graph backend. |
The backend is selected by the type of the backend= config you pass to a factory: the in-process analyzer (default) or a Neo4jConnectionConfig for the read-only graph backend.
Analysis cache (Python): caching is owned by
codeanalyzer-python— the backend virtualenv, CodeQL database, and analysis cache live undercache_dir(default<project>/.codeanalyzer). CodeQL is on by default, so the first run is slow (it provisions a CodeQL DB) and later runs reuse a checksum-validated cache. Add the cache directory to your.gitignore.
Architecture
The user interacts only with the top-level CLDK interface (core.py), which configures the session, initializes the language-specific pipeline, and exposes a high-level, language-agnostic API. Each language module is built from two pieces: data models and an analysis backend.
graph TD
User <--> CLDK
CLDK --> M[cldk.models<br/>typed Pydantic schemas]
CLDK --> A[cldk.analysis]
A --> J[cldk.analysis.java]
A --> P[cldk.analysis.python]
A --> T[cldk.analysis.typescript]
J --> EJ[codeanalyzer-java<br/>WALA · JavaParser]
P --> EP[codeanalyzer-python<br/>Jedi · CodeQL]
T --> ET[codeanalyzer-typescript<br/>ts-morph · Jelly]
J -. read-only .-> N[(Neo4j)]
P -. read-only .-> N
T -. read-only .-> N
Data models — each language has its own set of Pydantic models under cldk.models (cldk.models.java, cldk.models.python, cldk.models.typescript). They give you structured, typed, dot-accessible representations of classes, methods, fields, and statements, with JSON serialization and shared conventions across languages.
Analysis backends — each language has a backend under cldk.analysis.<language> that coordinates its engine (see the table above) and maps the result onto the data models. The read-only Neo4j backends (cldk.analysis.<language>.neo4j) reconstruct the same models from a Cypher graph, so they are drop-in interchangeable with the in-process analyzers. Backends are orchestrated internally; you only call high-level methods such as get_symbol_table(), get_method_body(...), and get_call_graph(...), and CLDK handles tool coordination, parsing, and marshalling under the hood.
Documentation
Full documentation lives at codellm-devkit.info.
Contributing
We welcome contributors of all experience levels — see the CONTRIBUTING guide to get started.
Citation
If you use CLDK in your research, please cite:
@article{krishna2024codellm,
title = {Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights},
author = {Krishna, Rahul and Pan, Rangeet and Pavuluri, Raju and Tamilselvam, Srikanth and Vukovic, Maja and Sinha, Saurabh},
journal = {arXiv preprint arXiv:2410.13007},
year = {2024}
}
Related publications:
- Pan, Rangeet, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. "Multi-language Unit Test Generation using LLMs." arXiv preprint arXiv:2409.03093 (2024).
- Pan, Rangeet, Rahul Krishna, Raju Pavuluri, Saurabh Sinha, and Maja Vukovic. "Simplify your Code LLM solutions using CodeLLM Dev Kit (CLDK)." Blog.
Maintainers
| Name | |
|---|---|
| Rahul Krishna | i.m.ralk@gmail.com |
| Rangeet Pan | rangeet.pan@ibm.com |
| Saurabh Sinha | sinhas@us.ibm.com |
Licensed under the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cldk-1.2.0.tar.gz.
File metadata
- Download URL: cldk-1.2.0.tar.gz
- Upload date:
- Size: 128.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbf4ceb9cb2fa9ef6cbbff8098dc03ab8f8e6360506f13f2983ed00fd5130971
|
|
| MD5 |
2d7c5048749d01b2107a146146f4d4c0
|
|
| BLAKE2b-256 |
92de05f17caf4fd430c83c3412a9c46b30c1fd45691b4080e4e7d2d019d5cce9
|
File details
Details for the file cldk-1.2.0-py3-none-any.whl.
File metadata
- Download URL: cldk-1.2.0-py3-none-any.whl
- Upload date:
- Size: 168.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a281cbf11a28193672059dd6235b172abc4df433a1697c269478907a776ccde9
|
|
| MD5 |
b4e571c70f06c0ce378f20a9942bdaf2
|
|
| BLAKE2b-256 |
a5a732e09c188448c65f375470198ef7daddf4d4b2727a7b8daa007af5e70d9d
|