Convert GFF3 gene model to DDBJ MSS format

Project description

GFF2MSS: GFF3 converter for DDBJ submission via MSS

====

MSS (Mass Submission System) on DDBJ requires uniq annotation format file for data submission. I here made a python script converting the standard gff3 gene model file to the MSS annotation. This script makes an MSS file from a gff3 file for gene modeling, a tsv file for annotation file, and a fasta file containing genomic sequence. We recommend pre-processing your gff3 data via AGAT, agat_convert_sp_gxf2gxf.pl --gff YOUR_GFF.gff and GFF3sort.pl, before your conversion. After the making of MSS file, you should fill "COMMON" entries (SUBMITTER, REFERENCE, etc.) before the submission for DDBJ. This software is a non-official converter for MSS. We do not guarantee that DDBJ accepts the generated files.

v.4.2.1

Previous pip installation was not correctly configured. This version fixes the setup and packaging for proper installation via pip.

pip install gff2mss
gff2mss --help

v.4.2

our package is now available for installation via pip. You can easily install it using the following command:

pip install gff2mss

v.4.1

Several new features have been added. Thanks to @kfuku52.

Modifications improve flexibility while ensuring compliance with DDBJ and annotation validation tools (e.g., transChecker).

Added country and collection_date fields to the source features

Example: country: Singapore: Tampines, collection_date: 2019-03-26
Added isolate field to the source feature.

Example: isolate: SING2019-196
Modified sex field .

Example: sex: male

Updated ff_definition format to:

@@[organism]@@ @@[isolate]@@ DNA, @@[submitter_seqid]@@
Introduced new options in GFF2MSS:

--iso for setting isolate

--sex for setting sex

--cou for setting country

--cod <collection_date> for setting collection_date

--mag (sets the minimum size of assembly_gap)

--gel <gap_assembly size known/unknown> (sets whether the gap size is known or unknown,)

User-Defined locus_tag Support

Allows users to specify their own locus_tag values instead of GFF2MSS’s numbering system.

Uses the third column of the TSV file provided via --ann.

If the TSV has only two columns (ID & Description), the behavior remains unchanged.

Adds > or < symbols to indicate incomplete start or stop codons.

v.4.0

Support for ambiguous "N"-base in fasta. ("N/n"-base will be converted to "assembly_gap" feature in MSS). Please install "gffpandas" library.
According to the "N"-base supporting, "artificial_location" qualifier was adopted to mark the modified exon/CDS by GFF2MSS. To fix frameshift caused by "N"-base gapping, V.4.0 cut 1-2 bases of the exon as necessary. These artificially modified mRNA data will be marked with this qualifier on the MSS file.
Processing speed was improved by using pandas (e.g., v3 = 14.8s, v4 = 4.2s).

v.3.0

--pid option is available to marge the previous protein ID to the new submission.
"@@[entry]@@", and "@@[submitter_seqid]@@" were used in submitter_seqid.

v.2.0

rRNA and tRNA gene models were supported.
A utility for tRNAscan result processing was applied.
The license was changed from CC BY to the MIT License

Requirement

Python 3.7. (Biopython, numpy, pandas, argparse, bcbio-gff, gffpandas)

Usage

usage: GFF2MSS.py [-h] -f FASTA -g GFF -a ANN -l LOC -n NAM [-s STN] -o OUT [-m MOL] [-p PID] [-t GTY] [-c GCT] [--ifc IFC] [--stc STC] [--iso ISO]
                  [--sex SEX] [--cou COU] [--cod COD] [--mag MAG] [--gel known|unknown] [--fwg asis|misc_feature] [--mis INT]
                  
optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        File path to a genome sequence file
  -g GFF, --gff GFF     gff3 file for gene modeling
  -a ANN, --ann ANN     tsv file for gene annotation The 'ID' and 'Description' columns are mandatory. 'Locus_tag' is optional.
  -l LOC, --loc LOC     locus_tag prefix
  -n NAM, --nam NAM     organism name
  -s STN, --stn STN     strain
  -o OUT, --out OUT     output MSS file path (default = out.mss.txt)
  -m MOL, --mol MOL     mol_type value (default = genomic DNA)
  -p PID, --pid PID     file for protein ID (Only for the genome version-up)
  -t GTY, --gty GTY     type of linkage_evidence (default = paired-ends)
  -c GCT, --gct GCT     number of Genetic Code Tables (default = 1)
  --ifc IFC             default=no: inferring the completeness of gene models by the presence of start and stop codons and add '>' or '<' to the output.
  --stc STC             default=ATG: comma-separated list of start codons
  --iso ISO             default=: The 'isolate' value. See https://www.ddbj.nig.ac.jp/ddbj/file-format-e.html
  --sex SEX             default=: The 'sex' value. See https://www.ddbj.nig.ac.jp/ddbj/file-format-e.html
  --cou COU             default=: The 'country' value. See https://www.ddbj.nig.ac.jp/ddbj/file-format-e.html
  --cod COD             default=: The 'collection_date' value. See https://www.ddbj.nig.ac.jp/ddbj/file-format-e.html
  --mag MAG             default=0: Minimum size of 'gap_assembly'. Ns smaller than this size are not annotated as 'gap_assembly'.
  --gel known|unknown   default=known: Whether the estimate sizes of 'gap_assembly' are known.
  --fwg asis|misc_feature
                        default=asis: How to describe features spanning an 'assembly_gap'.
  --mis INT             default=0: Introns smaller than this size are annotated with 'artificial_location'.

Demo

Example 1: Plastid DNA

python3 GFF2MSS.py \
-f example/Lj3.0_Chloroplastl.fna \
-g example/Lj3.0_cp_gene_models.gff3  \
-a example/Lj3.0_anno.txt \
-l "PRE_TEST_" \
-n "Demo japonicus" \
-s "MG-20" \
-c "11" \
-o mss.ex1.out.txt

Example 2: Eukaryotic nuclear DNA with rDNA and tRNA data

python3 GFF2MSS.py \
-f example2/test.fa \
-g example2/test.gff  \
-a example2/annot.list \
-l "PRE_TEST_" \
-n "Demo japonicus" \
-s "DAOM100" \
-o mss.ex2.out.txt

Example 3: Eukaryotic nuclear DNA with "N"-base gapping

python3 GFF2MSS.py \
-f example2/test.fa \
-g example2/test.gff  \
-a example2/annot.list \
-l "PRE_TEST_" \
-n "Demo japonicus" \
-s "BB2" \
-t "paired-ends" \
-o mss.ex2.out.txt

rDNA data

To distinct the type of rDNA sequence, please add an attribute, "Type=", for each rRNA sub-features.

18S ribosomal RNA; 18S
internal transcribed spacer 1; ITS1
5.8S ribosomal RNA; 5.8S
internal transcribed spacer 2; ITS2
28S ribosomal RNA; 28S

e.g.,
chr1	.	gene	2472097	2473907	.	+	.	ID=rRNA_001;Name=chr1_1
chr1	.	rRNA	2472097	2473907	.	+	.	Parent=rRNA_001;ID=rRNA_0011;Name=chr1_1-18S;Type=18S
chr1	.	exon	2472097	2473907	.	+	.	Parent=rRNA_0011;ID=rRNA_00111;Name=rRNA_00111
chr1	.	gene	2473908	2474013	.	+	.	ID=rRNA_002;Name=chr1_1-ITS1;Type=ITS1
chr1	.	rRNA	2473908	2474013	.	+	.	Parent=rRNA_002;ID=rRNA_0021;Name=chr1_1-ITS1;Type=ITS1
chr1	.	exon	2473908	2474013	.	+	.	Parent=rRNA_0021;ID=rRNA_00211;Name=rRNA_00211
chr1	.	gene	2474014	2474165	.	+	.	ID=rRNA_003;Name=chr1-1-5.8S;Type=5.8S
chr1	.	rRNA	2474014	2474165	.	+	.	Parent=rRNA_003;ID=rRNA_0031;Name=chr1-1-5.8S;Type=5.8S
chr1	.	exon	2474014	2474165	.	+	.	Parent=rRNA_0031;ID=rRNA_00311;Name=rRNA_00311

tDNA data

Please use the following GFF3 format for tRNA data.

e.g.,
chr1	.	gene	2	74	.	+	.	ID=t91_gene
chr1	.	tRNA	2	74	.	+	.	Parent=t91_gene;ID=t91_tRNA;Name=tRNA-Lys;anticodon=(pos:45..47,aa:Lys)
chr1	.	exon	2	51	.	+	.	Parent=t91_tRNA;ID=t91_exon_1
chr1	.	exon	60	74	.	+	.	Parent=t91_tRNA;ID=t91_exon_2

"Name" attribute is used as "product" and "anticodon" attribute will be applied as "attribute" in MSS annotation file, as below.

e.g.,
	tRNA	join(2..51,60..74)   	product	tRNA-Lys
			locus_tag	TES_000011100
			anticodon	(pos:45..47,aa:Lys)

./utl/tRNA2gff3.py converts the following type of tab-separated file, which is manually modified from tRNAscan structure prediction, to the GFF3

e.g.,
source	start	end	Type	Anticodon	AntC_start	AntC_end	intron_start	intron_end	Possible_pseudogene
chr1	491917	492013	Ile	TAT	491952	491954	491956	491977	
chr1	917911	917808	Ile	TAT	917876	917874	917872	917845	pseudogene
chr1	917525	917429	Asn	ATT	917490	917488	917486	917463	
chr1	915945	915827	Pro	CGG	915909	915907	915905	915869	pseudogene
chr1	899982	899904	Ile	GAT	899952	899950

Licence

MIT License

Author

Taro Maeda

Project details

Release history Release notifications | RSS feed

This version

4.2.1

May 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff2mss-4.2.1.tar.gz (16.2 kB view details)

Uploaded May 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gff2mss-4.2.1-py3-none-any.whl (13.6 kB view details)

Uploaded May 27, 2025 Python 3

File details

Details for the file gff2mss-4.2.1.tar.gz.

File metadata

Download URL: gff2mss-4.2.1.tar.gz
Upload date: May 27, 2025
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for gff2mss-4.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b779addc31b4487766eacd272fbdad23b581641780b4e079429c679586b8f8f5`
MD5	`a9a732a84f1779c6c8c4eaf85dcd11ae`
BLAKE2b-256	`496deed39cf0af8579581c15dd2e3e6ae5f1abea36eca370973e4006a84e6692`

See more details on using hashes here.

File details

Details for the file gff2mss-4.2.1-py3-none-any.whl.

File metadata

Download URL: gff2mss-4.2.1-py3-none-any.whl
Upload date: May 27, 2025
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.0

File hashes

Hashes for gff2mss-4.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de91cbd829612ec66556c7e1c970ea8bcdb376cde48360146f699a9b23860854`
MD5	`b660810a8a9bd1f0d96bd8ad72519072`
BLAKE2b-256	`b9f0b84cb939ff79af10a07e914103fa2d676a485371453f445d14e06f22fa5d`

See more details on using hashes here.

gff2mss 4.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GFF2MSS: GFF3 converter for DDBJ submission via MSS

v.4.2.1

v.4.2

v.4.1

v.4.0

v.3.0

v.2.0

Requirement

Usage

Demo

rDNA data

tDNA data

Licence

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes