Skip to main content

User-friendly tool for automated annotation of metadata with open-source LLM

Project description

myLLannotator

Overview

User-friendly tool for automated annotation of metadata with open-source LLM

by Alyssa Lu Lee and Rohan Maddamsetti

Github

Quickstart

Step-by-step installation with conda

conda create -n myllannotator-env
conda activate myllannotator-env
conda install pip
pip install myllannotator
ollama pull llama3.2:latest
myllannotator --help

See How to run and Usage. Example input files are located in input/.

Documentation

  1. Requirements
  2. Installation
  3. Downloading the llama3.2 model
  4. How to run
  5. Usage
  6. Optional arguments and using other models
  7. Important caveats and troubleshooting
  8. Replicating results in the paper

A muscular cyborg rainbow llama with a face stripe like ziggy stardust and a long rainbow mane working hard on a laptop

Requirements

  • python>=3.10
  • ollama>=0.6.1
  • tqdm>=4.41.0

To reproduce paper figures:

  • R==4.2+ for generating figures and re-running analyses in this paper

Installation

There are two options for using this code. The first way is to install the prebuilt package, which should automatically install dependencies. The second way is to download the script src/myllannotator/main.py and run it directly (but you will have to install dependencies yourself).

The package can be installed from PyPI or from source (tarball).

How to install as a package from PyPI:

pip install myllannotator

How to install as a package from source:

First download the compressed binary from the latest release. Then run:

pip install myllannotator-*.tar.gz

Downloading the llama3.2 model

Download the llama3.2 model from ollama, after installing the ollama package (This is a required step.):

ollama pull llama3.2:latest

How to run

The following sample commands use the example data under input/ and write the output to a new file annotated_data.csv.

How to run with package installed:

myllannotator input/valid_categories.txt input/system_prompt.txt input/per_sample_prompt.txt input/input_data.csv annotated_data.csv

How to run without package installed (assuming all dependencies are installed):

python main.py input/valid_categories.txt input/system_prompt.txt input/per_sample_prompt.txt input/input_data.csv annotated_data.csv

Usage

Brief overview of command line usage:

usage: myllannotator [-h]
                     valid_categories system_prompt per_sample_prompt
                     input_csv output_csv

positional arguments:
  valid_categories   .txt file of valid categories, separated by line breaks.
  system_prompt      .txt file containing system prompt
  per_sample_prompt  .txt file containing per-sample prompt
  input_csv          .csv file of input data
  output_csv         .csv file for output data

options:
  -h, --help         show this help message and exit

Also see input/ for examples of each input format.

valid_categories (.txt)

List of categories separated by line breaks. Make sure to include an NA category if you want the model to have the option to assign no annotation.

Example:

Human
Animal
NA

system_prompt (.txt)

The system prompt guides the LLM's overall behavior. Here you should give specific instructions for the annotation task.

Optionally, you can include {categories} somewhere in the text, which will be replaced by a comma-separated list of the values in valid_categories (for example, "Human", "Animal", "NA"). The tool will print the properly formatted version upon running so you can check if it is what you expected.

Example:

You are an annotation tool for labeling the environment category that a microbial sample came from, given the host and isolation source metadata reported for this genome. Label the sample as one of the following categories: {categories} by following the following criteria. Samples from a human body should be labeled 'Humans'. Samples from domesticated or farm animals [...] Give a strictly one-word response that exactly matches of these categories, omitting punctuation marks.

per_sample_prompt (.txt)

The per-sample prompt tells the LLM the relevant metadata for each sample.

The way you write this prompt will depend on the columns in your input data. Where you write {0} in the prompt, it will be replaced by the value in column 0 of the input data, {1} will be replaced by the value in column 1, etc. See the example below. The tool will print the properly formatted version upon running so you can check if it is what you expected.

Optionally, you can include {categories} somewhere in the text, which will be replaced by a comma-separated list of the values in valid_categories (for example, "Human", "Animal", "NA"). The tool will print the properly formatted version upon running so you can check if it is what you expected.

Example (for an input dataset with three columns Annotation_Accession,host,isolation_source):

Consider a microbial sample from the host "{1}" and the isolation source "{2}". Label the sample as one of the following categories: {categories}. Give a strictly one-word response without punctuation marks.

For the first sample, the prompt received by the LLM will be:

Consider a microbial sample from the host "chicken" and the isolation source "Epidemic materials". Label the sample as one of the following categories: "Humans", "Livestock", "Food", "Freshwater", "Anthropogenic", "Marine", "Sediment", "Agriculture", "Soil", "Terrestrial", "Plants", "Animals", "NA". Give a strictly one-word response without punctuation marks.

input_csv (.csv)

This is your input data. It can have any number of columns. You will need to write your per-sample prompt according to the column order (see above).

Annotation_Accession,host,isolation_source
GCF_019552145.1_ASM1955214v1,chicken,Epidemic materials
GCF_001635975.1_ASM163597v1,Homo sapiens,NA
GCF_900636445.1_41965_G01,NA,Oral Cavity

output_csv (.csv)

Path to the output file, which will be created as the program runs.

The format of the output file will be the same as the input file, with one additional column for the annotation.

Example:

Annotation_Accession,host,isolation_source,Annotation
GCF_019552145.1_ASM1955214v1,chicken,Epidemic materials,Livestock
GCF_001635975.1_ASM163597v1,Homo sapiens,NA,Humans
GCF_900636445.1_41965_G01,NA,Oral Cavity,Humans

Optional arguments and using other models

Optional arguments:

options:
  -h, --help            show this help message and exit
  --model-name MODEL_NAME
                        ollama model name, default is llama3.2:latest
  --max-tries MAX_TRIES
                        maximum number of attempts per sample if the LLM
                        response is invalid, default is 5
  --silent              if enabled, do not print usual prompt output
  --debug               if enabled, print debug output, and only annotate the
                        first 5 samples
  --disable-system-role
                        Disables the system role, instead having the system
                        prompt come from the user. Set this option when using
                        LLMs that do not have a system role.

We have only extensively tested the code with llama3.2:latest. Many other models are available at https://ollama.com/search of varying size, speed, and accuracy. Cloud models may not work due to limits on the number of queries.

To use another model, first download the model, and then add the model name to your command. For example:

ollama pull gemma3:latest

And then modify your command to include --model-name gemma3:

myllannotator input/valid_categories.txt input/system_prompt.txt input/per_sample_prompt.txt input/input_data.csv annotated_data.csv --model-name gemma3 --debug --max-tries 3

To view all the models you have downloaded:

ollama list

Important caveats and troubleshooting

  • The tool is not deterministic. Different answers may be produced on the same input data.
  • The tool will give up on labeling a particular sample after the number of failed attempts exceeds the maximum limit. In that case NoAnnotation will show up as the annotation.
  • If you get an error like ollama._types.ResponseError: model 'llama3.2:latest' not found (status code: 404), that means you have not downloaded the model from ollama. Run ollama pull llama3.2:latest to fix the error.

Replicating results in the paper

  1. Download data: Go to https://rutgers.box.com/v/myLLannotator-data and click the button to download the data.
  2. Unzip myLLannotator-data.zip and go into the project directory: cd myLLannotator-data
  3. Download scripts from github repository: Save paper/annotator.py and paper/simple-ARG-duplication-analysis.R into a new subdirectory src/. Your file structure should look like this:
myLLannotator-data
├── data
│   └── Maddamsetti2024
│       ├── all-proteins.csv
│       ├── computationally-annotated-gbk-annotation-table.csv
│       ├── duplicate-proteins.csv
│       ├── FileS3-Complete-Genomes-with-Duplicated-ARG-annotation.csv
│       └── gbk-annotation-table.csv
├── results
└── src
    ├── annotator.py
    └── simple-ARG-duplication-analysis.R
  1. Make sure required dependencies are installed.
  2. Run the python script src/annotator.py to annotate the data. Output files will be saved to the results folder.
  3. Run the R script src/simple-ARG-duplication-analysis.R to run the analysis and generate figures.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

myllannotator-0.1.6.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

myllannotator-0.1.6-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file myllannotator-0.1.6.tar.gz.

File metadata

  • Download URL: myllannotator-0.1.6.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for myllannotator-0.1.6.tar.gz
Algorithm Hash digest
SHA256 12e08935f3f2c5cb69e372d7b78436c57c2bbf9238f530100fd445ef43214703
MD5 db33474a60bc57f172d7ab80e9ec2b61
BLAKE2b-256 450728a6d6dc4261e8f90bfefc101264baab32dfd40103d3856a8e9ed054a13b

See more details on using hashes here.

File details

Details for the file myllannotator-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: myllannotator-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for myllannotator-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d92ec228b594fb0ab2f8fa92cf4fb089d0e431a9337f62035b5b45ed7f6ffb06
MD5 0bdd326766a0269fa590434f52ea9ad4
BLAKE2b-256 14432e3b2ef2b7da085de7022593168c490b85436f2a40ac65a09a645b54fed3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page