Skip to main content

Primitives for research into LLMs and code

Project description

nl-code

Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.

Install

uv add nl-code                # core
uv add nl-code[docker]        # + Docker execution via dr-docker
uv add nl-code[bigcodebench]  # + scientific libs for BigCodeBench/ClassEval

Code Execution

Execute generated code in isolated Docker containers.

Three execution modes covering all supported dataset test formats:

  • function_call — call a named function with inputs, compare return values (HumanEval)
  • assertion — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
  • unittest — exec code + unittest.TestCase classes (ClassEval)

Batch variants (batch_run_test_cases, batch_run_assertion_tests, batch_run_unittest_tests) process many code samples in a single container with auto-chunking.

Build The Docker Image

Build the execution image from the repo root:

docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .

This is the default runtime image used by the execution pipeline. The Dockerfile installs both the bigcodebench dependency set and the pinned dr-docker runtime dependency directly from pyproject.toml, so the image stays aligned with the repo's declared execution requirements.

Run The Docker Test Tier

Docker-dependent tests are marked with @pytest.mark.docker and are excluded from the default pytest run.

Run them explicitly with:

uv run nl-code-test docker

You can pass extra pytest arguments through after docker, for example:

uv run nl-code-test docker -q tests/test_execution_runner.py

Datasets

Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into Task objects, and cached locally. DatasetSlice supports filtering, seeded shuffling, and limit.

Dataset Explorer

A FastAPI + React app for browsing and comparing datasets. Run from ui/dataset-explorer/.

Headless validation runs

General dataset validation/debugging commands that import matplotlib should run headlessly with:

MPLBACKEND=Agg uv run python ...

Rebuild Dataset Caches

Run the Docker-backed cache rebuilds with:

uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
uv run python -m nl_code.datasets.cache_cli rebuild class-eval
uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro

cache_cli rebuild sets MPLBACKEND=Agg automatically.

Current observed results with the default execution image and env limits:

humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
humaneval-pro: cached 164 tasks (164 raw, 0 flawed)
mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
class-eval: cached 98 tasks (98 raw, 2 flawed)
bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)

The remaining flawed samples above are dataset-level failures, not Docker runtime failures.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nl_code-0.5.0.tar.gz (369.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nl_code-0.5.0-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file nl_code-0.5.0.tar.gz.

File metadata

  • Download URL: nl_code-0.5.0.tar.gz
  • Upload date:
  • Size: 369.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.0

File hashes

Hashes for nl_code-0.5.0.tar.gz
Algorithm Hash digest
SHA256 68a83afe9fdaee6f1ddfa92e8cc4149062b68dd05147b408e074983cb316cab1
MD5 c22266bb8a0127f3246296810c62bdee
BLAKE2b-256 c4290db7b75921d325336e2b1c7c13bcbfa366f416ef476114f4340adbaf8037

See more details on using hashes here.

File details

Details for the file nl_code-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: nl_code-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 44.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.0

File hashes

Hashes for nl_code-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5c14e085e699aa26cb1e3e828dee50e7da1736f6153d63d7b7dc8dae2b448b0
MD5 07a050dc0c78c75681b115ac987e3ba5
BLAKE2b-256 7588de958a16584e268986063ef722647c6938c41beefb312bfdabc9b2c111c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page