Primitives for research into LLMs and code

Project description

nl-code

Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.

Install

uv add nl-code                # core
uv add nl-code[docker]        # + Docker execution via dr-docker
uv add nl-code[bigcodebench]  # + scientific libs for BigCodeBench/ClassEval

Code Execution

Execute generated code in isolated Docker containers.

Three execution modes covering all supported dataset test formats:

function_call — call a named function with inputs, compare return values (HumanEval)
assertion — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
unittest — exec code + unittest.TestCase classes (ClassEval)

Batch variants (batch_run_test_cases, batch_run_assertion_tests, batch_run_unittest_tests) process many code samples in a single container with auto-chunking.

Build The Docker Image

Build the execution image from the repo root:

docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .

This is the default runtime image used by the execution pipeline. The Dockerfile installs both the bigcodebench dependency set and the pinned dr-docker runtime dependency directly from pyproject.toml, so the image stays aligned with the repo's declared execution requirements.

Run The Docker Test Tier

Docker-dependent tests are marked with @pytest.mark.docker and are excluded from the default pytest run.

Run them explicitly with:

uv run nl-code-test docker

You can pass extra pytest arguments through after docker, for example:

uv run nl-code-test docker -q tests/test_execution_runner.py

Datasets

Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into Task objects, and cached locally.

The corresponding raw task models preserve the original dataset inputs as source__... fields and expose richer derived artifacts such as:

official prompt fields
stripped and comment-preserving code stubs
stripped and comment-preserving ground-truth code

Across task families, new_official_prompt, new_code_stub, and new_code_stub_with_comments provide a consistent interface for prompt/stub access even when the underlying dataset-specific field names differ.

DatasetSlice supports filtering, seeded shuffling, limits, and parallel accessors for common raw-task artifacts:

get_source_code(task_id)
get_official_prompt(task_id)
get_code_stub(task_id)
get_code_stub_with_comments(task_id)

Dataset Explorer

A FastAPI + React app for browsing and comparing datasets. Run from ui/dataset-explorer/.

Headless validation runs

General dataset validation/debugging commands that import matplotlib should run headlessly with:

MPLBACKEND=Agg uv run python ...

Rebuild Dataset Caches

Run the Docker-backed cache rebuilds with:

uv run python -m nl_code.datasets.cache_cli rebuild all
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
uv run python -m nl_code.datasets.cache_cli rebuild class-eval
uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro

cache_cli rebuild sets MPLBACKEND=Agg automatically.

Current observed results with the default execution image and env limits:

humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
humaneval-pro: cached 163 tasks (163 raw, 1 flawed)
mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
class-eval: cached 98 tasks (98 raw, 2 flawed)
bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)

The remaining flawed samples above are dataset-level failures, not Docker runtime failures.

The current known flawed HumanEval-Pro sample is HumanEvalPro/24, where the new function docstring is not present in new_solution.

Project details

Release history Release notifications | RSS feed

0.7.0

Jun 9, 2026

This version

0.6.0

Apr 17, 2026

0.5.0

Apr 14, 2026

0.4.1

Apr 13, 2026

0.4.0

Apr 13, 2026

0.2.0

Apr 13, 2026

0.1.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nl_code-0.6.0.tar.gz (46.8 MB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nl_code-0.6.0-py3-none-any.whl (51.8 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file nl_code-0.6.0.tar.gz.

File metadata

Download URL: nl_code-0.6.0.tar.gz
Upload date: Apr 17, 2026
Size: 46.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.0

File hashes

Hashes for nl_code-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`7f8ddda0f160516a0a715dbed53ce2c93903b8f6df9ee0fa0a09ad26dbc8d550`
MD5	`0e14031c9687e0a58f930dfb186629ae`
BLAKE2b-256	`405bc2a34b1f431f839075642da710e0933621f50e14a3a4a323af2ea60b3e7d`

See more details on using hashes here.

File details

Details for the file nl_code-0.6.0-py3-none-any.whl.

File metadata

Download URL: nl_code-0.6.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 51.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.0

File hashes

Hashes for nl_code-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7127e499eca8c987f54d88e0b90a0fa16ded40b26b647b94dedd9ac92ae92f6`
MD5	`2ccba70dd59b79a535940960f7371160`
BLAKE2b-256	`c24ccd00304cda55fca82fcce6c95c8e9a3e6135fbbd9c681154db6ea2425676`

See more details on using hashes here.

nl-code 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

nl-code

Install

Code Execution

Build The Docker Image

Run The Docker Test Tier

Datasets

Dataset Explorer

Headless validation runs

Rebuild Dataset Caches

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes