Primitives for research into LLMs and code
Project description
nl-code
Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.
Install
uv add nl-code # core
uv add nl-code[docker] # + Docker execution via dr-docker
uv add nl-code[bigcodebench] # + scientific libs for BigCodeBench/ClassEval
Code Execution
Execute generated code in isolated Docker containers.
Three execution modes covering all supported dataset test formats:
- function_call — call a named function with inputs, compare return values (HumanEval)
- assertion — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
- unittest — exec code + unittest.TestCase classes (ClassEval)
Batch variants (batch_run_test_cases, batch_run_assertion_tests, batch_run_unittest_tests) process many code samples in a single container with auto-chunking.
Build The Docker Image
Build the execution image from the repo root:
docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .
This is the default runtime image used by the execution pipeline. The Dockerfile
installs both the bigcodebench dependency set and the pinned dr-docker
runtime dependency directly from pyproject.toml, so the image stays aligned
with the repo's declared execution requirements.
Run The Docker Test Tier
Docker-dependent tests are marked with @pytest.mark.docker and are excluded
from the default pytest run.
Run them explicitly with:
uv run nl-code-test docker
You can pass extra pytest arguments through after docker, for example:
uv run nl-code-test docker -q tests/test_execution_runner.py
Datasets
Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into Task objects, and cached locally. DatasetSlice supports filtering, seeded shuffling, and limit.
Dataset Explorer
A FastAPI + React app for browsing and comparing datasets. Run from ui/dataset-explorer/.
Headless validation runs
General dataset validation/debugging commands that import matplotlib should run headlessly with:
MPLBACKEND=Agg uv run python ...
Rebuild Dataset Caches
Run the Docker-backed cache rebuilds with:
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
uv run python -m nl_code.datasets.cache_cli rebuild class-eval
uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro
cache_cli rebuild sets MPLBACKEND=Agg automatically.
Current observed results with the default execution image and env limits:
humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
humaneval-pro: cached 164 tasks (164 raw, 0 flawed)
mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
class-eval: cached 98 tasks (98 raw, 2 flawed)
bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)
The remaining flawed samples above are dataset-level failures, not Docker runtime failures.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nl_code-0.5.0.tar.gz.
File metadata
- Download URL: nl_code-0.5.0.tar.gz
- Upload date:
- Size: 369.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68a83afe9fdaee6f1ddfa92e8cc4149062b68dd05147b408e074983cb316cab1
|
|
| MD5 |
c22266bb8a0127f3246296810c62bdee
|
|
| BLAKE2b-256 |
c4290db7b75921d325336e2b1c7c13bcbfa366f416ef476114f4340adbaf8037
|
File details
Details for the file nl_code-0.5.0-py3-none-any.whl.
File metadata
- Download URL: nl_code-0.5.0-py3-none-any.whl
- Upload date:
- Size: 44.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5c14e085e699aa26cb1e3e828dee50e7da1736f6153d63d7b7dc8dae2b448b0
|
|
| MD5 |
07a050dc0c78c75681b115ac987e3ba5
|
|
| BLAKE2b-256 |
7588de958a16584e268986063ef722647c6938c41beefb312bfdabc9b2c111c2
|