No project description provided
Project description
hep-data-llm
Table of Contents
Introduction
This repo contains the code used to translate english queries for plots into the actual plots using LLM's and python packages and tools like ServiceX, Awkward, Vector, and hist. This is a proof-of-concept and not meant to be production level code.
Benchmark studies with the 8 adl-index as presented in conferences can be found in the results directory for various workflows.
Use --help at the root level and on commands (e.g. hep-data-llm plot --help) to get a complete list of options.
Installation
To run out of the box you'll need to do the following once:
Prerequisites:
- You'll need to have
dockerinstalled on your machine, and if you are on ARM, the multi-arch extensions (the source images are amd64 only):docker buildx create --name multiarch --driver docker-container --usedocker run --privileged --rm tonistiigi/binfmt --install amd64
- Build the
dockerimage to run the workflow. Which docker image is used depends on what workflow you are using.- ServiceX/Awkward:
docker build -t hepdatallm-awkward:latest Docker - ServiceX/RDF:
docker build -t hepdatallm-rdf:latest -f Docker/Dockerfile.RDF .that is used to runservicex,awkward, and friends:
- ServiceX/Awkward:
- If you are running a
servicexworkflow, get an access token. Make sure theservicex.yamlfile is either in your home directory or your current working directory. - You'll need token(s) to access the LLM. Here is what the
.envlooks like. Please create this either in your local directory or your home directory. Make sure only you can read it: this is access to a paid service!
api_openai_com_API_KEY=<openai-key>
api_together_xyz_API_KEY=<together.ai key>
openrouter_ai_API_KEY=<openrouter-key>
Running in a local python environment
pip install hep-data-llm
hep-data-llm plot "Plot the ETmiss of all events in the rucio dataset mc23_13p6TeV:mc23_13p6TeV.801167.Py8EG_A14NNPDF23LO_jj_JZ2.deriv.DAOD_PHYSLITE.e8514_e8528_a911_s4114_r15224_r15225_p6697." output.md
The output will be in output.md - view in a markdown rendering problem (I use vscode). A img directory will be created and it will contain the plot (hopefully).
Use hep-data-llm plot --help to see all the options you can give it. It defaults to using gpt-5, the most successful model in tests.
Default questions
A questions.yaml file is bundled with the package containing a list of common plotting questions. To run one of these questions by number, pass the index (starting from 1) instead of the full text:
hep-data-llm plot 1 output.md
This will execute the first question from questions.yaml.
Question references and metrics
Some questions include reference metrics for each expected plot to help validate the generated output. References live alongside the question text in questions.yaml and consist of per-plot average entries per event and mean values derived from the raw data list used to fill the histogram. For example:
questions:
- text: "Plot the ETmiss of all events in the rucio dataset user.zmarshal:user.zmarshal.364702_OpenData_v1_p6026_2024-04-23."
references:
plots:
- avg_entries_per_event: 1.0
mean: 38.5
When a question includes references, the generated plotting code is expected to print lines like METRIC: avg_entries_per_event=<N> mean=<M> for each plot, computed directly from the numbers passed into the histogram. The CLI will compare these against the reference pairs to determine success.
Running with uvx
This is great if you want to just run once or twice.
uvx hep-data-llm plot "Plot the ETmiss of all events in the rucio dataset mc23_13p6TeV:mc23_13p6TeV.801167.Py8EG_A14NNPDF23LO_jj_JZ2.deriv.DAOD_PHYSLITE.e8514_e8528_a911_s4114_r15224_r15225_p6697." output.md
This uses the uvx tool to install a temporary environment. If you want to keep this around to use, you can use uv tool install hep-data-llm. Do remember to update it every now and then!
Usage
new profile
Use the new profile <filename> to create a new profile. It copies the default profile, and you can then modify it and update it with new prompt or other items.
Creating a new workflow
Otherwise known as creating a new prompt, this is about creating a new prompt file and hint files and what it takes in the context of this package.
- General preparation - you'll need a docker container with the appropriate software installed. You'll also need a good set of test instructions.
- Use the
hep-data-llm new profile my-prompt.yamlcommand to create a "dummy" prompt file. - Edit the new profile
yamlfile:- If you are editing new hint files, then replace the list of hint files with a local (relative) reference to the hint files you want to use
- Choose a fairly cheap model to run (since you'll probably be running it a lot). Change the
model:entry (or you can use the--modeloption).
- When you are ready to test, use
hep-data-llm plot --profile my-prompt --ignore-cache hints <question> output.md. Replace<question>with your question or a question number from the default list of questions.- Note the
ignore-cache- the code always caches the hints files, even if they are located on the local disk. - Use
--repeat Nto record multiple, independent runs for each model. Every trial bypasses the LLM cache automatically so you get fresh outputs for each repetition without reusing earlier responses.
- Note the
Notes from adding a servicex-RDF workflow:
- The guardrail that looked for the
pngfile to be written out had to be altered - Hint files that described
servicexassumedawkwardoutput - it had to be split in two so that there was a short hint file that described how to generate aservicexrequest and a second one that described how to take the results and turn them into aawkwarddata. The same thing then had to occur forrdf. - A new
dockercontainer had to be built, in this case based on the ROOT container image.
License
hep-data-llm is distributed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hep_data_llm-2.0.0.tar.gz.
File metadata
- Download URL: hep_data_llm-2.0.0.tar.gz
- Upload date:
- Size: 37.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef2957618678dd7c0384aa183692b23a672841ac8383170bf0be0acf9bf128e0
|
|
| MD5 |
3e51d55defdf307d179cceae1a1eabaf
|
|
| BLAKE2b-256 |
dad7423396472dd968f49abf548e321a1dc2a72c988a60f3fbd5e1b0a6b1104a
|
Provenance
The following attestation bundles were made for hep_data_llm-2.0.0.tar.gz:
Publisher:
publish.yaml on gordonwatts/hep-data-llm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hep_data_llm-2.0.0.tar.gz -
Subject digest:
ef2957618678dd7c0384aa183692b23a672841ac8383170bf0be0acf9bf128e0 - Sigstore transparency entry: 1579614828
- Sigstore integration time:
-
Permalink:
gordonwatts/hep-data-llm@9e94f3a3661ad4dc2e1ffa3e86752c074a9edb81 -
Branch / Tag:
refs/tags/2.0.0 - Owner: https://github.com/gordonwatts
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@9e94f3a3661ad4dc2e1ffa3e86752c074a9edb81 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hep_data_llm-2.0.0-py3-none-any.whl.
File metadata
- Download URL: hep_data_llm-2.0.0-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36d65cd7de2af08dc88f07417fcd57acdf9fc3f9b5c01c1801efae0f0162a2c9
|
|
| MD5 |
1f0ca31e2dc1ceb77f6680112da5f42c
|
|
| BLAKE2b-256 |
934200b3799546f2626b7597bf1f46de98c569edb119634e3ab7c060fe3e2262
|
Provenance
The following attestation bundles were made for hep_data_llm-2.0.0-py3-none-any.whl:
Publisher:
publish.yaml on gordonwatts/hep-data-llm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hep_data_llm-2.0.0-py3-none-any.whl -
Subject digest:
36d65cd7de2af08dc88f07417fcd57acdf9fc3f9b5c01c1801efae0f0162a2c9 - Sigstore transparency entry: 1579615077
- Sigstore integration time:
-
Permalink:
gordonwatts/hep-data-llm@9e94f3a3661ad4dc2e1ffa3e86752c074a9edb81 -
Branch / Tag:
refs/tags/2.0.0 - Owner: https://github.com/gordonwatts
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@9e94f3a3661ad4dc2e1ffa3e86752c074a9edb81 -
Trigger Event:
push
-
Statement type: