User-friendly tool for automated annotation of metadata with open-source LLM
Project description
myLLannotator
Alyssa Lee and Rohan Maddamsetti
- Requirements
- Installation
- Downloading the llama3.2 model
- How to run
- Usage
- Optional arguments and using other models
- Important caveats and troubleshooting
- Replicating results in the paper
Overview
User-friendly tool for automated annotation of metadata with open-source LLM
Requirements
- python>=3.10
- ollama>=0.6.1
- tqdm>=4.41.0
To reproduce paper figures:
- R==4.2+ for generating figures and re-running analyses in this paper
Installation
There are two options for using this code. The first way is to install the prebuilt package, which should automatically install dependencies. The second way is to download the script src/myllannotator/main.py and run it directly (but you will have to install dependencies yourself). (If you are doing this, skip this section.)
The package can be installed from PyPI or from source (tarball).
How to install as a package from PyPI:
This is the test version on TestPyPI.
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple myllannotator
(Once it is published to PyPI, the command pip install myllannotator should work)
How to install as a package from source:
First download the compressed binary from the latest release. Then run:
pip install myllannotator-*.tar.gz
Downloading the llama3.2 model
Download the llama3.2 model from ollama, after installing the ollama package (This is a required step.):
ollama pull llama3.2:latest
How to run
The following sample commands use the example data under input/ and write the output to a new file annotated_data.csv.
How to run with package installed:
myllannotator input/valid_categories.txt input/system_prompt.txt input/per_sample_prompt.txt input/input_data.csv annotated_data.csv
How to run without package installed (assuming all dependencies are installed):
python main.py input/valid_categories.txt input/system_prompt.txt input/per_sample_prompt.txt input/input_data.csv annotated_data.csv
Usage
Brief overview of command line usage:
usage: myllannotator [-h]
valid_categories system_prompt per_sample_prompt
input_csv output_csv
positional arguments:
valid_categories .txt file of valid categories, separated by line breaks.
system_prompt .txt file containing system prompt
per_sample_prompt .txt file containing per-sample prompt
input_csv .csv file of input data
output_csv .csv file for output data
options:
-h, --help show this help message and exit
Also see input/ for examples of each input format.
valid_categories (.txt)
List of categories separated by line breaks. Make sure to include an NA category if you want the model to have the option to assign no annotation.
Example:
Human
Animal
NA
system_prompt (.txt)
The system prompt guides the LLM's overall behavior. Here you should give specific instructions for the annotation task.
Optionally, you can include {categories} somewhere in the text, which will be replaced by a comma-separated list of the values in valid_categories (for example, "Human", "Animal", "NA"). The tool will print the properly formatted version upon running so you can check if it is what you expected.
Example:
You are an annotation tool for labeling the environment category that a microbial sample came from, given the host and isolation source metadata reported for this genome. Label the sample as one of the following categories: {categories} by following the following criteria. Samples from a human body should be labeled 'Humans'. Samples from domesticated or farm animals [...] Give a strictly one-word response that exactly matches of these categories, omitting punctuation marks.
per_sample_prompt (.txt)
The per-sample prompt tells the LLM the relevant metadata for each sample.
The way you write this prompt will depend on the columns in your input data. Where you write {0} in the prompt, it will be replaced by the value in column 0 of the input data, {1} will be replaced by the value in column 1, etc. See the example below. The tool will print the properly formatted version upon running so you can check if it is what you expected.
Optionally, you can include {categories} somewhere in the text, which will be replaced by a comma-separated list of the values in valid_categories (for example, "Human", "Animal", "NA"). The tool will print the properly formatted version upon running so you can check if it is what you expected.
Example (for an input dataset with three columns Annotation_Accession,host,isolation_source):
Consider a microbial sample from the host "{1}" and the isolation source "{2}". Label the sample as one of the following categories: {categories}. Give a strictly one-word response without punctuation marks.
For the first sample, the prompt received by the LLM will be:
Consider a microbial sample from the host "chicken" and the isolation source "Epidemic materials". Label the sample as one of the following categories: "Humans", "Livestock", "Food", "Freshwater", "Anthropogenic", "Marine", "Sediment", "Agriculture", "Soil", "Terrestrial", "Plants", "Animals", "NA". Give a strictly one-word response without punctuation marks.
input_csv (.csv)
This is your input data. It can have any number of columns. You will need to write your per-sample prompt according to the column order (see above).
Annotation_Accession,host,isolation_source
GCF_019552145.1_ASM1955214v1,chicken,Epidemic materials
GCF_001635975.1_ASM163597v1,Homo sapiens,NA
GCF_900636445.1_41965_G01,NA,Oral Cavity
output_csv (.csv)
Path to the output file, which will be created as the program runs.
The format of the output file will be the same as the input file, with one additional column for the annotation.
Example:
Annotation_Accession,host,isolation_source,Annotation
GCF_019552145.1_ASM1955214v1,chicken,Epidemic materials,Livestock
GCF_001635975.1_ASM163597v1,Homo sapiens,NA,Humans
GCF_900636445.1_41965_G01,NA,Oral Cavity,Humans
Optional arguments and using other models
Optional arguments:
options:
-h, --help show this help message and exit
--model-name MODEL_NAME
ollama model name, default is llama3.2:latest
--max-tries MAX_TRIES
maximum number of attempts per sample if the LLM
response is invalid, default is 5
--silent if enabled, do not print usual prompt output
--debug if enabled, print debug output, and only annotate the
first 5 samples
--disable-system-role
Disables the system role, instead having the system
prompt come from the user. Set this option when using
LLMs that do not have a system role.
We have only extensively tested the code with llama3.2:latest. Many other models are available at https://ollama.com/search of varying size, speed, and accuracy. Cloud models may not work due to limits on the number of queries.
To use another model, first download the model, and then add the model name to your command. For example:
ollama pull gemma3:latest
And then modify your command to include --model-name gemma3:
myllannotator input/valid_categories.txt input/system_prompt.txt input/per_sample_prompt.txt input/input_data.csv annotated_data.csv --model-name gemma3 --debug --max-tries 3
To view all the models you have downloaded:
ollama list
Important caveats and troubleshooting
- The tool is not deterministic. Different answers may be produced on the same input data.
- The tool will give up on labeling a particular sample after the number of failed attempts exceeds the maximum limit. In that case
NoAnnotationwill show up as the annotation. - If you get an error like
ollama._types.ResponseError: model 'llama3.2:latest' not found (status code: 404), that means you have not downloaded the model from ollama. Runollama pull llama3.2:latestto fix the error.
Replicating results in the paper
- Download data: Go to https://rutgers.box.com/v/myLLannotator-data and click the button to download the data.
- Unzip
myLLannotator-data.zipand go into the project directory:cd myLLannotator-data - Download scripts from this github repository:
Save
paper/annotator.pyandpaper/simple-ARG-duplication-analysis.Rinto a new subdirectorysrc/. Your file structure should look like this:
myLLannotator-data
├── data
│ └── Maddamsetti2024
│ ├── all-proteins.csv
│ ├── computationally-annotated-gbk-annotation-table.csv
│ ├── duplicate-proteins.csv
│ ├── FileS3-Complete-Genomes-with-Duplicated-ARG-annotation.csv
│ └── gbk-annotation-table.csv
├── results
└── src
├── annotator.py
└── simple-ARG-duplication-analysis.R
- Make sure required dependencies are installed.
- Run the python script
src/annotator.pyto annotate the data. Output files will be saved to theresultsfolder. - Run the R script
src/simple-ARG-duplication-analysis.Rto run the analysis and generate figures.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file myllannotator-0.1.5.tar.gz.
File metadata
- Download URL: myllannotator-0.1.5.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc12e7200814e5bcf3bf9403fc9111f642c81c23b8eb2ab84f0bded42d7cd272
|
|
| MD5 |
9c4ebbdd8b373ac1a57b271694c24bb7
|
|
| BLAKE2b-256 |
057c2a1b4679618e16e76b25f025719a32088afb4287820691860f981b229c8e
|
File details
Details for the file myllannotator-0.1.5-py3-none-any.whl.
File metadata
- Download URL: myllannotator-0.1.5-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb6689c0ddd6910e6646e611b2319556888f3b329871d410775cef2970c1f6cb
|
|
| MD5 |
d378b0c07dde84f30e7b8747722176d8
|
|
| BLAKE2b-256 |
fd18877520edc086b336daa75e7d7acb79246194b2165299cfa6f84d9d8618f8
|