Primitives for research into LLMs and code
Project description
nl-code
Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.
Install
uv add nl-code # core
uv add nl-code[docker] # + Docker execution via dr-docker
uv add nl-code[bigcodebench] # + scientific libs for BigCodeBench/ClassEval
Code Execution
Execute generated code in isolated Docker containers.
Three execution modes covering all supported dataset test formats:
- function_call — call a named function with inputs, compare return values (HumanEval)
- assertion — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
- unittest — exec code + unittest.TestCase classes (ClassEval)
Batch variants (batch_run_test_cases, batch_run_assertion_tests, batch_run_unittest_tests) process many code samples in a single container with auto-chunking.
Build The Docker Image
Build the execution image from the repo root:
docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .
This is the default runtime image used by the execution pipeline. The Dockerfile
installs both the bigcodebench dependency set and the pinned dr-docker
runtime dependency directly from pyproject.toml, so the image stays aligned
with the repo's declared execution requirements.
Run The Docker Test Tier
Docker-dependent tests are marked with @pytest.mark.docker and are excluded
from the default pytest run.
Run them explicitly with:
uv run nl-code-test docker
You can pass extra pytest arguments through after docker, for example:
uv run nl-code-test docker -q tests/test_execution_runner.py
Datasets
Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into Task objects, and cached locally.
Derived Task objects use schema version v3:
target: TaskTargetwithnameandkind("function"or"class")source: TaskSourcewith runnable ground-truth code insource.code
Raw task models preserve the original dataset inputs in a nested source object. Derived artifacts such as ground-truth code, parsed test suites, and official prompts are exposed as @cached_property helpers (gt_solution, test_suite, prompts, and family-specific views) and are not serialized into cache payloads.
DatasetSlice supports filtering, seeded shuffling, limits, and accessors for common artifacts:
get_source_code(task_id)— normalized runnable code from the derivedTaskget_official_prompt(task_id)— dataset-specific official prompt (HumanEval returns the raw HuggingFace prompt)
Parsed dataset caches use schema version 3. Rebuild after upgrading:
uv run python -m nl_code.datasets.cache_cli rebuild all
Dataset Explorer
A FastAPI + React app for browsing and comparing datasets. Run from ui/dataset-explorer/.
HumanEval DSPy Experiments
This branch adds a small DSPy evaluation workflow for comparing direct code generation against an encoder-decoder setup on HumanEval.
scripts/humaneval_dspy_eval.pyruns the evaluation from the command line. It writes a run JSON plus generation-history JSONL records underlogs/. ENCDEC eval usesraw.code_stub(full prompt with docstrings) as the default encoder input; pass--encoder-input oracleto feedraw.gt_solution.codefor oracle round-trip checks. The decoder always receivesraw.function_stub, which strips docstrings while preserving comments. Random--n-samplesselection includes only HumanEval tasks whose tests expose expected outputs (inputs_resultsshape). Tasks that compare against a reference function (inputs_ref_func) are skipped even when selected explicitly via--task-id.scripts/optimize_humaneval_dspy_direct.pyandscripts/optimize_humaneval_dspy_encdec.pyrun MIPRO optimization for the direct and encoder-decoder HumanEval programs.scripts/optimize_humaneval_dspy_direct_gepa.pyandscripts/optimize_humaneval_dspy_encdec_gepa.pyrun GEPA optimization for the same program families.src/nl_code/optim/humaneval_dspy_eval.pycontains the reusable evaluation loop, generation config, per-attempt results, and summary models.src/nl_code/optim/dspy_generators.pydefines the direct generator and the encoder-decoder generator used by the eval.src/nl_code/optim/humaneval_dspy_optimize.pyandsrc/nl_code/optim/humaneval_dspy_gepa.pycontain reusable optimizer orchestration, split handling, artifact writing, and summary models. Optimization event logging uses a per-context logger;dspy.configure(lm=...)remains process-global, so run one optimization or eval job per process.src/nl_code/optim/humaneval_dspy_logs.pyparses eval logs into a nested Pydantic snapshot for notebook analysis. It preserves run stats, per-attempt results, and individual LM calls, including both encoder and decoder calls for new encoder-decoder runs.scripts/parse_humaneval_dspy_logs.pyis a thin wrapper that parses the currentlogs/directory into a snapshot JSON.nbs/exp/human_eval_dspy.pyis a marimo notebook for inspecting the workflow, loading the parsed snapshot, comparing pass rates, and stepping through failed cases side by side for direct and encoder-decoder generations.scripts/sample_humaneval_dspy_splits.pysamples train/dev/eval task splits from the full direct and encoder-decoder eval logs.
Typical usage:
OPENROUTER_API_KEY=... uv run python scripts/humaneval_dspy_eval.py --generation-type both --n-samples 20
uv run python scripts/parse_humaneval_dspy_logs.py --logs-dir logs --output-path logs/human_eval_dspy_snapshot_latest.json
uv run marimo edit nbs/exp/human_eval_dspy.py
DSPy Log And Report Inspection
Forensic tooling works in layers. Flat logs/ output from eval and optimization
is not a session root on its own.
logs/ ──parse_humaneval_dspy_logs.py──► one aggregate snapshot JSON
logs/ ──sessionize_dspy_logs_v0.py────► sessionized corpus (metadata.json + raw/)
sessionized corpus ──inspect_dspy_* --walk──► parsed_*_reports/
parsed_gepa_reports/ ──build_dspy_gepa_agent_bundle.py──► agent bundle JSON
Use parse_humaneval_dspy_logs.py for quick notebook-style exploration across
all files in logs/. Use sessionize_dspy_logs_v0.py before
inspect_dspy_eval_session.py or inspect_dspy_gepa_session.py. Those inspect
scripts require a session directory containing metadata.json; pointing them at
raw subdirectories such as logs/eval_full_5x/baseline_direct will fail.
The canonical sessionized corpus lives outside the repo at
~/drotherm/data/code-comp/dspy-exps/v0. Regenerate it from the repo root:
SESSIONIZE_SOURCE_ROOT=$PWD \
SESSIONIZE_OUTPUT_ROOT=~/drotherm/data/code-comp/dspy-exps/v0 \
uv run python scripts/sessionize_dspy_logs_v0.py
uv run python scripts/inspect_dspy_eval_session.py \
~/drotherm/data/code-comp/dspy-exps/v0 --walk
uv run python scripts/inspect_dspy_gepa_session.py \
~/drotherm/data/code-comp/dspy-exps/v0 --walk
uv run python scripts/build_dspy_gepa_agent_bundle.py \
~/drotherm/data/code-comp/dspy-exps/v0/parsed_gepa_reports
Scripts:
scripts/sessionize_dspy_logs_v0.pygroups raw DSPy log artifacts into session directories and writes session metadata.scripts/inspect_dspy_eval_session.pyparses one eval session, or walks a corpus, into*.eval_report.jsonfiles with runs, samples, attempts, generation calls, aggregates, and parse notes.scripts/inspect_dspy_gepa_session.pyparses one GEPA optimizer session, or walks a corpus, into*.gepa_report.jsonfiles with optimizer runs, programs, split/task scores, metric calls, generated outputs, optimizer iterations, and safegepa_state.binmetadata scans.scripts/build_dspy_gepa_agent_bundle.pycombines the per-session GEPA reports into one cross-sessiongepa_optimization_agent_bundle.jsonfor downstream analysis agents or UI tooling. The bundle omits raw LLM request payloads; treat parsed forensic reports as sensitive if shared externally.docs/dspy-log-sessions-v0.mddocuments the sessionized log corpus and sessionization rules.docs/dspy-eval-optimizer-extraction-progress.mdrecords extraction progress and the known limits of eval versus optimizer logs.docs/session_000018_gepa_prompt_variants.mdis a concrete session-level prompt-variant review for the most complete direct GEPA trace.
The report extractors use Python's standard-library json module because these
artifacts can contain very large integers that are not safe with srsly's
ujson backend.
DSPy Static Viewer
ui/dspy-eval-static-viewer/ contains a self-contained static viewer generated
from the parsed eval and GEPA reports. Open
ui/dspy-eval-static-viewer/viewer.html directly in a browser; it loads
data/viewer_data.js locally and does not require a backend server.
The viewer includes:
- a GEPA prompt-flow tab with full prompt text, candidate lineage, scores, and per-task heatmaps;
- a HumanEval full-5x sample variation matrix with task drilldowns; and
- CSV exports for prompt nodes and stable/unstable task summaries.
The committed viewer is isolated from the existing ui/dataset-explorer app.
It intentionally includes only the browser-loadable data bundle and CSV exports,
not the duplicate JSON payload or one-off preprocessing script from the original
Desktop bundle.
Headless validation runs
General dataset validation/debugging commands that import matplotlib should run headlessly with:
MPLBACKEND=Agg uv run python ...
Rebuild Dataset Caches
Run the Docker-backed cache rebuilds with:
uv run python -m nl_code.datasets.cache_cli rebuild all
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
uv run python -m nl_code.datasets.cache_cli rebuild class-eval
uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro
cache_cli rebuild sets MPLBACKEND=Agg automatically.
Current observed results with the default execution image and env limits:
humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
humaneval-pro: cached 163 tasks (163 raw, 1 flawed)
mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
class-eval: cached 98 tasks (98 raw, 2 flawed)
bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)
The remaining flawed samples above are dataset-level failures, not Docker runtime failures.
The current known flawed HumanEval-Pro sample is HumanEvalPro/24, where the
new function docstring is not present in new_solution.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nl_code-0.7.0.tar.gz.
File metadata
- Download URL: nl_code-0.7.0.tar.gz
- Upload date:
- Size: 49.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a0def2199a0346154acb45fa45479ab6b81567043a1a5d8f712e4fdfe49ee42
|
|
| MD5 |
a8ad81908421d904053be4bde09cae2a
|
|
| BLAKE2b-256 |
214ac42c18401eb50126af7aa07c0b4773aa72abc012b36e84b937de9b55ad0b
|
File details
Details for the file nl_code-0.7.0-py3-none-any.whl.
File metadata
- Download URL: nl_code-0.7.0-py3-none-any.whl
- Upload date:
- Size: 80.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56d35ff076cf5dba0dcd80ef566f22f4b098a6edcc0681550d94a9d5e3b6bef6
|
|
| MD5 |
e2a838757cbf0e1a189aa40875583a30
|
|
| BLAKE2b-256 |
acdd46c2416aabd3e26889b7eb58553c53faa9c87eba537e5db3af1b5bdb492f
|