Skip to main content

Primitives for research into LLMs and code

Project description

nl-code

Primitives for research into LLMs and code generation. Provides dataset loading, code execution (with Docker isolation), code analysis, and a dataset explorer UI.

Install

uv add nl-code                # core
uv add nl-code[docker]        # + Docker execution via dr-docker
uv add nl-code[bigcodebench]  # + scientific libs for BigCodeBench/ClassEval

Code Execution

Execute generated code in isolated Docker containers.

Three execution modes covering all supported dataset test formats:

  • function_call — call a named function with inputs, compare return values (HumanEval)
  • assertion — exec code + assertion-based test code (HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro)
  • unittest — exec code + unittest.TestCase classes (ClassEval)

Batch variants (batch_run_test_cases, batch_run_assertion_tests, batch_run_unittest_tests) process many code samples in a single container with auto-chunking.

Build The Docker Image

Build the execution image from the repo root:

docker build -t nl-code/code-eval-scientific:v1 -f docker/scientific.Dockerfile .

This is the default runtime image used by the execution pipeline. The Dockerfile installs both the bigcodebench dependency set and the pinned dr-docker runtime dependency directly from pyproject.toml, so the image stays aligned with the repo's declared execution requirements.

Run The Docker Test Tier

Docker-dependent tests are marked with @pytest.mark.docker and are excluded from the default pytest run.

Run them explicitly with:

uv run nl-code-test docker

You can pass extra pytest arguments through after docker, for example:

uv run nl-code-test docker -q tests/test_execution_runner.py

Datasets

Loaders for HumanEval, HumanEval-Pro, MBPP-Pro, BigCodeBench Lite Pro, and ClassEval. Datasets are fetched from HuggingFace, parsed into Task objects, and cached locally.

The corresponding raw task models preserve the original dataset inputs as source__... fields and expose richer derived artifacts such as:

  • official prompt fields
  • stripped and comment-preserving code stubs
  • stripped and comment-preserving ground-truth code

Across task families, new_official_prompt, new_code_stub, and new_code_stub_with_comments provide a consistent interface for prompt/stub access even when the underlying dataset-specific field names differ.

DatasetSlice supports filtering, seeded shuffling, limits, and parallel accessors for common raw-task artifacts:

  • get_source_code(task_id)
  • get_official_prompt(task_id)
  • get_code_stub(task_id)
  • get_code_stub_with_comments(task_id)

Dataset Explorer

A FastAPI + React app for browsing and comparing datasets. Run from ui/dataset-explorer/.

Headless validation runs

General dataset validation/debugging commands that import matplotlib should run headlessly with:

MPLBACKEND=Agg uv run python ...

Rebuild Dataset Caches

Run the Docker-backed cache rebuilds with:

uv run python -m nl_code.datasets.cache_cli rebuild all
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-plus
uv run python -m nl_code.datasets.cache_cli rebuild humaneval-pro
uv run python -m nl_code.datasets.cache_cli rebuild mbpp-pro
uv run python -m nl_code.datasets.cache_cli rebuild class-eval
uv run python -m nl_code.datasets.cache_cli rebuild bigcodebench-lite-pro

cache_cli rebuild sets MPLBACKEND=Agg automatically.

Current observed results with the default execution image and env limits:

humaneval-plus: cached 163 tasks (163 raw, 1 flawed)
humaneval-pro: cached 163 tasks (163 raw, 1 flawed)
mbpp-pro: cached 375 tasks (375 raw, 3 flawed)
class-eval: cached 98 tasks (98 raw, 2 flawed)
bigcodebench-lite-pro: cached 54 tasks (54 raw, 3 flawed)

The remaining flawed samples above are dataset-level failures, not Docker runtime failures.

The current known flawed HumanEval-Pro sample is HumanEvalPro/24, where the new function docstring is not present in new_solution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nl_code-0.6.0.tar.gz (46.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nl_code-0.6.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file nl_code-0.6.0.tar.gz.

File metadata

  • Download URL: nl_code-0.6.0.tar.gz
  • Upload date:
  • Size: 46.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.0

File hashes

Hashes for nl_code-0.6.0.tar.gz
Algorithm Hash digest
SHA256 7f8ddda0f160516a0a715dbed53ce2c93903b8f6df9ee0fa0a09ad26dbc8d550
MD5 0e14031c9687e0a58f930dfb186629ae
BLAKE2b-256 405bc2a34b1f431f839075642da710e0933621f50e14a3a4a323af2ea60b3e7d

See more details on using hashes here.

File details

Details for the file nl_code-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: nl_code-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.0

File hashes

Hashes for nl_code-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7127e499eca8c987f54d88e0b90a0fa16ded40b26b647b94dedd9ac92ae92f6
MD5 2ccba70dd59b79a535940960f7371160
BLAKE2b-256 c24ccd00304cda55fca82fcce6c95c8e9a3e6135fbbd9c681154db6ea2425676

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page