A command-line tool to download and curate RefSeq plasmid sequences from NCBI.
Project description
refseq-plasmid-dl
refseq-plasmid-dl is a Python command-line tool for downloading, parsing, filtering, and curating plasmid sequences from the NCBI RefSeq database. It provides flexible metadata filtering, sequence property filtering, and generates comprehensive reports and curated multi-FASTA files.
This project was heavily inspired by the excellent ncbi-genome-download tool, which is a staple for fetching bacterial assemblies. However, I found it difficult to specifically target and download only plasmid sequences using its existing filters. I built this tool to bridge that gap, providing a dedicated workflow for retrieving plasmid data from RefSeq without the overhead of filtering through full genome assemblies.
Features
- Download Plasmid Sequences: Recursively fetches all gzipped FASTA files (.fna.gz) and GenBank files (.gbff.gz) from the NCBI RefSeq plasmid FTP/HTTPS directories.
- Metadata Extraction: Parses GenBank files to extract organism, taxonomy, plasmid name, strain, host, collection date, and other key metadata.
- Flexible Filtering: Filter sequences by:
- Organism or Genus
- NCBI Taxonomy ID
- Strain, Isolate, Host, Plasmid Name, Geographic Location, Isolation Source
- Sequence properties (minimum/maximum length, topology: circular/linear)
- Record date or collection date
- Combining & Output: Generates a single curated multi-FASTA file and a CSV metadata table.
- Reporting: Produces a detailed summary report with counts of total records, filtered records by reason, and sequences written.
- Development Mode: Quick test download of a single file set for rapid testing.
- Reprocessing Existing Data: Skip download and re-run filtering/reporting on already downloaded files.
Installation
Prerequisites
- Python 3.9+
- Internet access required for downloads and NCBI Entrez queries.
Using PyPI
pip install refseq-plasmid-dl
Using Bioconda
conda install -c bioconda refseq-plasmid-dl
From Source
git clone https://github.com/yourusername/refseq-plasmid-dl.git
cd refseq-plasmid-dl
pip install .
Usage
The main command is:
refseq-plasmid-dl
Basic Download and Report
Download all RefSeq plasmids and generate metadata/report:
refseq-plasmid-dl
Curating Specific Organism or Topology
Download and retain circular plasmids for a specific organism, e.g., Salmonella:
refseq-plasmid-dl --outdir Salmonella_plasmids --organism Salmonella --topology circular
Reprocessing Existing Files
Skip download and apply new filters to an existing directory:
refseq-plasmid-dl --indir existing_data --organism Escherichia --topology circular
Command-Line Arguments
| Argument | Shorthand | Default | Description |
|---|---|---|---|
| --outdir | -o | refseq_plasmids | Output directory for FASTA files, reports, and metadata. |
| --indir | -i | None | Use existing download directory, skip FTP/HTTPS download. |
| --dev-mode | -d | False | Development mode: download a single test file set. |
| --force | False | Force re-download of existing files. | |
| --organism | -s | all | Filter by species/organism (substring match). |
| --taxid | -t | all | Filter by NCBI Taxonomy ID. |
| --strain | None | Filter by strain. | |
| --isolate | None | Filter by isolate. | |
| --host | None | Filter by host organism. | |
| --plasmid-name | None | Filter by plasmid name. | |
| --geo_loc_name | None | Filter by geographic location. | |
| --isolation_source | None | Filter by isolation source. | |
| --min-length | None | Minimum sequence length (bp). | |
| --max-length | None | Maximum sequence length (bp). | |
| --topology | circular | Filter by topology: circular, linear, or all. | |
| --min-date | None | Include only records updated after YYYY-MM-DD. | |
| --max-date | None | Include only records updated before YYYY-MM-DD. | |
| --min-collection-date | None | Include only records collected after YYYY-MM-DD. | |
| --max-collection-date | None | Include only records collected before YYYY-MM-DD. | |
| --version | -v | Show program version. |
Output Directory Structure
[OUTDIR]/
├── plasmids_index.html # Saved HTML listing of NCBI plasmid directory
├── refseq_plasmids/ # Raw downloaded FASTA and GBFF files
│ ├── plasmid.*.genomic.fna.gz
│ └── plasmid.*.genomic.gbff.gz
├── refseq_plasmids_dl.fasta # Final curated multi-FASTA
├── refseq_plasmids_dl_metadata.csv # Full metadata for filtered plasmids
└── refseq_plasmids_dl_report.csv # Summary report with filter
Examples
Download all plasmids and generate report
refseq-plasmid-dl --topology all
Filter by organism and circular topology
refseq-plasmid-dl --organism Escherichia --topology circular
Skip download and reprocess existing data
refseq-plasmid-dl --indir refseq_plasmids --min-length 5000 --max-length 200000
Note on AI Assistance
Parts of this codebase and documentation were drafted with the help of Google Gemini. The AI was used primarily for creating the project outline, writing development logic, and generating test cases. The final logic and implementation details have been verified manually.
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for full details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refseq_plasmid_dl-0.1.0.tar.gz.
File metadata
- Download URL: refseq_plasmid_dl-0.1.0.tar.gz
- Upload date:
- Size: 53.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
190db2ec8e76f20466b022ab08e49299c5c83ee6af28c7ea95b53cfb1f3e4679
|
|
| MD5 |
b504f6f64a16ddb74ac285d478f54a73
|
|
| BLAKE2b-256 |
4fbd26f04cefe9b618ac62b4d9b6bba6655ac334b7223efd1738f8f1eac9cd26
|
File details
Details for the file refseq_plasmid_dl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: refseq_plasmid_dl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 37.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4731227665f5b864c4623afd1f8536501355ac77b16be8ce7da7a44e4528a9a4
|
|
| MD5 |
4042021e2d441ae0836210904caf0c4a
|
|
| BLAKE2b-256 |
5c955d74469b4a52413409633927d519fa371e36c465bcb9c5206741e8bd1ca1
|