A package to automatically access the inverted repeats of archived plastid genomes
Project description
airpg: Automatically accessing the inverted repeats of archived plastid genomes
A Python package for automatically accessing the inverted repeats of thousands of plastid genomes stored on NCBI Nucleotide
INSTALLATION
To get the most recent stable version of airpg, run:
pip install airpg
Or, alternatively, if you want to get the latest development version of airpg, run:
pip install git+https://github.com/michaelgruenstaeudl/airpg.git
EXAMPLE USAGE
EXAMPLE 1: Very short survey (runtime ca. 5 min.; for the impatient)
Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide within the past 10 days.
TODAY=$(date +%d)
if (($TODAY >= 6 && $TODAY <= 10)); then
STARTDATE=$(date +%Y/%m/01)
elif (($TODAY >= 11 && $TODAY <= 15)); then
STARTDATE=$(date +%Y/%m/05)
elif (($TODAY >= 16 && $TODAY <= 20)); then
STARTDATE=$(date +%Y/%m/10)
elif (($TODAY >= 21 && $TODAY <= 25)); then
STARTDATE=$(date +%Y/%m/15)
else
PREVMONTH=$(printf "%02d" $(($(date +%m)-1)))
STARTDATE=$(date +%Y/$PREVMONTH/20)
fi
ENDDATE=$(date +%Y/%m/%d)
airpg_identify.py \
-q "complete genome[TITLE] AND \
(chloroplast[TITLE] OR plastid[TITLE]) AND \
$STARTDATE:$ENDDATE[PDAT] AND \
50000:250000[SLEN] NOT unverified[TITLE] \
NOT partial[TITLE] AND Magnoliophyta[ORGN]" \
-o output_script1.tsv \
#&> output_script1.log
mkdir -p records
mkdir -p data
airpg_analyze.py \
-i output_script1.tsv \
-m john.smith@example.com \
-o output_script2.tsv \
--recordsdir records/ \
--datadir data/ \
#&> output_script2.log
EXAMPLE 2: Short survey (runtime ca. 15 min.; for testing)
Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide within the current month.
airpg_identify.py -q "complete genome[TITLE] AND \
(chloroplast[TITLE] OR plastid[TITLE]) AND \
$(date +%Y/%m/01):$(date +%Y/%m/%d)[PDAT] AND \
50000:250000[SLEN] NOT unverified[TITLE] \
NOT partial[TITLE] AND Magnoliophyta[ORGN]" \
-o output_script1.tsv # &> output_script1.log
airpg_analyze.py -i output_script1.tsv \
-m john.smith@example.com -o output_script2.tsv \
# &> output_script2.log
EXAMPLE 3: Medium survey (runtime ca. 5 hours)
Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide in 2019 only. Note: The results of this survey are available on Zenodo via DOI 10.5281/zenodo.4335906
airpg_update_blocklist.py -f airpg_blocklist.txt \
-m john.smith@example.com -q "inverted[TITLE] AND \
repeat[TITLE] AND loss[TITLE]"
airpg_identify.py -q "complete genome[TITLE] AND \
(chloroplast[TITLE] OR plastid[TITLE]) AND \
2019/01/01:2019/12/31[PDAT] AND 50000:250000[SLEN] \
NOT unverified[TITLE] NOT partial[TITLE] AND \
Magnoliophyta[ORGN]" \
-b airpg_blocklist.txt -o output_script1.tsv
airpg_analyze.py -i output_script1.tsv \
-m john.smith@example.com -o output_script2.tsv
EXAMPLE 4: Full survey (runtime ca. 19 hours; with explanations)
Survey of all plastid genomes of flowering plants submitted to NCBI Nucleotide from start of 2000 until end of October 2020. Note: The results of this survey are available on Zenodo via DOI 10.5281/zenodo.4335906
STEP 1: Querying NCBI Nucleotide for complete plastid genomes given an Entrez search string
TESTFOLDER=./angiosperms_Start2000toEndOct2020
DATE=$(date '+%Y_%m_%d')
ENTREZSTRING='complete genome[TITLE] AND (chloroplast[TITLE] OR plastid[TITLE]) AND 2000/01/01:2020/10/31[PDAT] AND 50000:250000[SLEN] NOT unverified[TITLE] NOT partial[TITLE] AND Magnoliophyta[ORGN]' # complete plastid genomes of all flowering plants between start of 2000 and end of October 2020
RECORDSTABLE=plastome_availability_table_${DATE}.tsv
mkdir -p $TESTFOLDER
# Updating blocklist
if [ ! -f ./airpg_blocklist.txt ]; then
touch ./airpg_blocklist.txt
airpg_update_blocklist.py -f ./airpg_blocklist.txt
fi
airpg_update_blocklist.py -f ./airpg_blocklist.txt -m john.smith@example.com -q "inverted[TITLE] AND repeat[TITLE] AND loss[TITLE]"
airpg_identify.py -q "$ENTREZSTRING" -o $TESTFOLDER/$RECORDSTABLE \
--blocklist ./airpg_blocklist.txt 1>>$TESTFOLDER/airpg_identify_${DATE}.runlog 2>&1
STEP 2: Retrieving and parsing the genome records identified in step 1, analyzing the position and length of their IR annotations
IRSTATSTABLE=reported_IR_stats_table_${DATE}.tsv
mkdir -p $TESTFOLDER/records_${DATE}
mkdir -p $TESTFOLDER/data_${DATE}
airpg_analyze.py -i $TESTFOLDER/$RECORDSTABLE \
-r $TESTFOLDER/records_${DATE}/ -d $TESTFOLDER/data_${DATE}/ \
-m john.smith@example.com -o $TESTFOLDER/$IRSTATSTABLE 1>>$TESTFOLDER/airpg_analyze_${DATE}.runlog 2>&1
CHANGELOG
See CHANGELOG.md
for a list of recent changes to the software.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file airpg-1.0.6.tar.gz
.
File metadata
- Download URL: airpg-1.0.6.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.25.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1af1a3a48d961b1ac9767cf6cd6258dda49c21d07a4de0a1aac5e9825e99b805 |
|
MD5 | 06eebedb2c8d521ea2f7eaa4f52dfa14 |
|
BLAKE2b-256 | 6eef2b55a678bbaaaa92ff2b88ef53372074d308e1c07ab3a05baefdebf7a3ab |
File details
Details for the file airpg-1.0.6-py3-none-any.whl
.
File metadata
- Download URL: airpg-1.0.6-py3-none-any.whl
- Upload date:
- Size: 46.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.25.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f90312aef41d6b8e6ecd12db47313c881b39c92b27ffb249b255caa44fc626b |
|
MD5 | d708b4fd21e2a4a9310f41ef68ccbfd9 |
|
BLAKE2b-256 | 5fa2bf22055fc0a1294c3047f4e0be26f853a63d7962bec167f8f561c2c770ca |