Skip to main content

A CLI tool for collapsing and filtering structural genome annotations using gene family assignments

Project description

Adjudicator

Version: 1.0.0 · License: Apache-2.0 · Status: Production/Stable
PyPI: pip install adjudicator · Source: ncgr/Adjudicator
Author: Connor Cameron · ctc@ncgr.org · National Center for Genome Resources


Overview

Adjudicator is a command-line tool for collapsing and filtering structural genome annotations across multiple sources. It uses best-hit HMM domain scores from gene family assignments (via the Legume Information System) to compare overlapping gene models and select the best-supported annotation for all overlapping models.

Two commands are provided:

  • collapse — Merge overlapping gene models from two or more annotators, selecting the best model per region.
  • repeat-filter — Remove gene models that overlap known repeat or transposon regions beyond a configurable coverage threshold.

Requirements

Requirement Version
Python ≥ 3.10
click ≥ 8.1
intervaltree ≥ 3.2.1
sortedcontainers ≥ 2.4.0

Installation

pip install adjudicator
adjudicator --version
# adjudicator, version 0.1.0

Input File Formats

TSV Sample Sheet (--input-tsv)

Tab-separated. Lines beginning with # and blank lines are skipped.

Column Type Description
1 string Unique label for this evidence set.
2 path Path to the .gff3 structural annotation file.
3 path Path to the .gfa LIS gene family assignment file.

Row order determines precedence when gene models have equivalent scores.

# label	gff3_path	gfa_path
maker	/data/ann/maker.gff3	/data/fam/maker.gfa
helixer	/data/ann/helixer.gff3	/data/fam/helixer.gfa
stringtie	/data/ann/stringtie.gff3	/data/fam/stringtie.gfa

GFF3 (.gff3)

Standard GFF3 format with a three-level hierarchy: genemRNAexon. See the GFF3 specification.

GFA (.gfa)

LIS gene family assignment files produced by the Legume Information System gene family pipeline.


Commands

collapse

Collapses overlapping structural annotations across all entries in the TSV. Processing is hierarchical: Row 1 vs. Row 2 produces an intermediate result, which is then compared against Row 3, and so on.

Synopsis

adjudicator collapse --input-tsv <FILE> [OPTIONS]

Options

Option Short Type Default Valid range Description
--input-tsv -i path (required) Tab-separated sample sheet.
--min-overlap -m float 0.00001 0.0 – 1.0 Minimum fractional overlap of feature A by feature B to consider them overlapping.
--no-orphans -n flag False Exclude genes with no gene family assignment from the output.
--output-dir -o path . Directory to write output files. Created if it does not exist.
--strict / --no-strict flag False Exit with error if any referenced input file does not exist on disk.
--verbose -v flag False Print per-sample file paths and processing steps to stdout.

Output Files

<output-dir>/
├── A_B.wao.gff3                       # Overlap intersections
├── A_B.unique_b.gff3                  # Gene models unique to annotator B
├── A_B.final.gff3                     # Adjudicated gene IDs
├── A_B.gfa                            # Merged gene family assignments
└── A_B.final.wsubfeatures.gff3        # ✅ Primary output

Examples

adjudicator collapse \
    --input-tsv samples.tsv \
    --output-dir results/collapse/
adjudicator collapse \
    --input-tsv samples.tsv \
    --no-orphans \
    --min-overlap 0.4 \
    --output-dir results/collapse/ \
    --verbose

repeat-filter

Filters gene models from each entry in the TSV against a reference repeat annotation. Gene models whose exons exceed --max-coverage overlap with a repeat region are removed.

Synopsis

adjudicator repeat-filter --input-tsv <FILE> --annotation <FILE> [OPTIONS]

Options

Option Short Type Default Valid range Description
--input-tsv -i path (required) Tab-separated sample sheet.
--annotation -a path (required) GFF3 file of repeat regions to filter against.
--max-coverage -m float 0.4 0.0 – 1.0 Maximum fractional overlap between a gene's exons and a repeat region before the model is removed.
--output-dir -o path . Directory to write output files. Created if it does not exist.
--strict / --no-strict flag False Exit with error if any referenced input file does not exist on disk.
--verbose -v flag False Print per-sample file paths and processing steps to stdout.

Output Files

<output-dir>/
├── <label>_repeat_filter.wao.gff3                  # Overlap intersections
└── <label>_repeat_filter.final.wsubfeatures.gff3   # ✅ Primary output

Examples

adjudicator repeat-filter \
    --input-tsv samples.tsv \
    --annotation repeats.gff3 \
    --output-dir results/filtered/
adjudicator repeat-filter \
    --input-tsv samples.tsv \
    --annotation transposons.gff3 \
    --max-coverage 0.3 \
    --output-dir results/filtered/ \
    --strict \
    --verbose

Workflow

# Step 1: Filter repeat regions
adjudicator repeat-filter \
    --input-tsv raw_samples.tsv \
    --annotation repeats.gff3 \
    --output-dir step1_filtered/

# Step 2: Rewrite TSV to point to filtered outputs (GFA paths unchanged)
awk -F'\t' 'OFS="\t" { $2="step1_filtered/"$1"_repeat_filter.final.wsubfeatures.gff3"; print }' \
    raw_samples.tsv > filtered_samples.tsv

# Step 3: Collapse filtered annotations
adjudicator collapse \
    --input-tsv filtered_samples.tsv \
    --output-dir step2_collapsed/

Error Reference

Condition --strict off --strict on
Input file not found Warning to stderr Error: The following files were not found: ...
TSV row has wrong column count BadParameter: Line N: expected 3 columns, got N. Same
Label (column 1) is empty BadParameter: Line N: column 1 (label) must not be empty. Same
GFF3 path does not end in .gff3 BadParameter: Line N: column 2 must end in '.gff3' Same
GFA path does not end in .gfa BadParameter: Line N: column 3 must end in '.gfa' Same
TSV contains no data rows Error: No data rows found in '<file>'. Same

Glossary

Term Definition
Gene model A predicted gene structure represented as a genemRNAexon hierarchy in GFF3.
GFF3 Generic Feature Format version 3. Tab-delimited format for genomic features and their hierarchical relationships.
GFA Gene Family Assignment file from the LIS pipeline, containing HMM domain scores used to rank competing gene models.
Adjudication Selection of one gene model from a set of overlapping candidates based on HMM score evidence.
Orphan gene A gene model with no gene family assignment in the GFA file.
WAO intersection A bedtools-style "write all overlaps" operation reporting fractional overlap between features across two GFF3 files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adjudicator-1.0.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

adjudicator-1.0.0-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file adjudicator-1.0.0.tar.gz.

File metadata

  • Download URL: adjudicator-1.0.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for adjudicator-1.0.0.tar.gz
Algorithm Hash digest
SHA256 665ecd3881b6ea20d6b913054980afc128ee08c9601450a54d184254f47e3cd1
MD5 7cb2031e74ba2ee024b1823be19f4c09
BLAKE2b-256 ba0cb715f21e64eb458e3d367174463496003a1fe1ca483a7849b6991d3f733e

See more details on using hashes here.

File details

Details for the file adjudicator-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: adjudicator-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for adjudicator-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27e7cf80a3214fc90a76df937006d1d62172b82be44fcba1927c3f4e6602ea74
MD5 e0832f23f34b0b302d733d1f3af7567a
BLAKE2b-256 ebccbf9d5b336d4abcf5b5ace6de74490176cc860ab026a70843175060dd5a81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page