Comprehensive tool for downloading and managing ENA sequencing data
Project description
ENATool 🧬
A comprehensive Python package for downloading and managing sequencing data from the European Nucleotide Archive (ENA) in terminal and through Python interface.
✨ Features
- 📊 Extract Metadata - Get comprehensive sample information from ENA projects
- 📥 Download FASTQ Files - Automated download with progress tracking
- 🔄 Auto Fallback - Automatically tries NCBI if ENA metadata unavailable
- 📈 Progress Bars - Real-time progress for downloads and metadata retrieval
- 📋 Interactive Reports - Generate searchable HTML tables with DataTables.js
- 💾 Export to CSV - Save metadata in standard formats
- 🔍 Smart Verification - Check fastq file integrity and skip existing files
- 💻 Command line and Python interface
🚀 Quick Start
Installation
# Install from PyPI
pip install ENATool
Basic Usage in Terminal
# Custom output directory
enatool download PRJNA335681 --path data/my_project
Basic Usage in Python
import ENATool
# Fetch metadata AND download files in one command
info, downloads = ENATool.fetch('PRJNA335681', path='data/my_project', download=True)
📊 Example Output Files
ENATool creates organized output:
my_project/
├── PRJNA335681.csv # Sample metadata
├── PRJNA335681.html # Interactive table
├── downoad_info_table.csv # Download tracking
└── raw_reads/ # Downloaded FASTQ files
├── SRR123456/
│ ├── SRR123456_1.fastq.gz
│ └── SRR123456_2.fastq.gz
└── SRR123457/
└── SRR123457.fastq.gz
🔧 Requirements
- Python >= 3.7
- pandas >= 1.3.0
- numpy >= 1.20.0
- requests >= 2.25.0
- xmltodict >= 0.12.0
- tqdm >= 4.60.0
- lxml >= 4.6.0
📖 Documentation
Use ENATool in Terminal
Fetching Metadata
Download metadata for all samples in an ENA project using enatool fetch.
Syntax:
enatool fetch PROJECT_ID [--path DIR]
Arguments:
PROJECT_ID(required): ENA project accession (e.g., PRJNA335681)--path DIRor-p DIR: Output directory (default: PROJECT_ID)
What it does:
- Downloads sample metadata from ENA
- Tries NCBI BioSample as fallback if ENA fails
- Creates CSV file with all metadata
- Generates interactive HTML report
- Shows progress bars
Output files:
PROJECT_ID.csv- Metadata in CSV formatPROJECT_ID.html- Interactive HTML table
Examples:
# Basic usage - saves to PRJNA335681/
enatool fetch PRJNA335681
# Custom output directory
enatool fetch PRJNA335681 --path data/my_project
Download Reads and Fetch Metadata
Download metadata for all samples in an ENA project and download sample files using using enatool download.
Syntax:
enatool download PROJECT_ID [--path DIR]
Arguments:
PROJECT_ID(required): ENA project accession--path DIRor-p DIR: Output directory (default: PROJECT_ID)
What it does:
- Downloads metadata (same as
fetch) - Downloads all FASTQ files for all samples
- Uses enaDataGet tool
- Skips files that already exist
- Tracks download status
Output files:
PROJECT_ID.csv- MetadataPROJECT_ID.html- Interactive tabledownoad_info_table.csv- Download trackingraw_reads/- Directory with FASTQ filesSRR123456/- One directory per runSRR123456_1.fastq.gz- Forward readsSRR123456_2.fastq.gz- Reverse reads (if paired-end)
Examples:
# Download everything
enatool download PRJNA335681
# Custom output directory
enatool download PRJNA335681 --path data/project1
Show Project Summary [stdout]
Display summary information about a downloaded project using enatool info.
Syntax:
enatool info PROJECT_ID --path DIR
Arguments:
PROJECT_ID(required): ENA project accession--path DIRor-p DIR(required): Directory containing metadata
What it does:
- Reads metadata from CSV file
- Shows summary statistics
- Displays organism breakdown
- Shows sequencing platforms
- Shows download status (if available)
Examples:
# Show info for custom directory
enatool info PRJNA335681 --path data/my_project
Output:
📊 Project Information: PRJNAXXXXXX
============================================================
Total samples: 50
Organisms (2):
• Homo sapiens: 45
• Mus musculus: 5
Sequencing Platforms:
• ILLUMINA: 50
Library Strategies:
• RNA-Seq: 30
• WGS: 15
• ChIP-Seq: 5
Library Layout:
• PAIRED: 45
• SINGLE: 5
Download Status:
• OK: 48
• Error: 2
Redownload Corrupted Files or Download Only Selected Files
Download all FASTQ files using previously fetched metadata or based on the subsetted metadata table using enatool download-files. Also forces redownload of files which previously ended up with a error.
Syntax:
enatool download-files PROJECT_ID --path DIR
Arguments:
PROJECT_ID(required): ENA project accession--path DIRor-p DIR(required): Directory containing metadata
What it does:
- Loads sample names from existing CSV file (
PROJECT_ID.csv) - Downloads FASTQ files
- Useful if you already have metadata and just want the files or for filtered metadata tables.
Use cases:
- You fetched metadata earlier with
enatool fetch - You filtered the CSV file manually
- You want to re-download after failures
Examples:
# First get metadata (fast)
enatool fetch PRJNA335681 --path my_project
# Later, download files
enatool download-files PRJNA335681 --path my_project
# Or after filtering CSV file
enatool download-files PRJNA335681 --path my_project
Leave files with incorrect md5 checksum
By default ENATool removes all the files which ended up being corrupted or md5 chesum did not match. However, you may use --keep-failed paramter to prevent the removal.
Syntax:
# with download command
enatool download PROJECT_ID --path DIR --keep-failed
# with download-files command
enatool download-files PROJECT_ID --path DIR --keep-failed
Process multiple projects
For processing multiple projects:
# Simple loop
for project in PRJNA335681 PRJNA123456 PRJNA789012; do
echo "Processing $project..."
enatool fetch $project --path data/$project
done
# Or with download
for project in PRJNA335681 PRJNA123456; do
echo "Downloading $project..."
enatool download $project --path data/$project
done
Hide banner
Use a global enatool option: --no-banner. Follows right after enatool and before the action command.
Example:
enatool --no-banner fetch PRJNA335681
Disable progress bar
Use a global enatool option: --no-progress-bar. Follows right after enatool and before the action command.
Example:
enatool --no-progress-bar fetch PRJNA335681
__
Use ENATool in Python
Fetch Metadata
Use fetch() function to download metadata:
import ENATool
# Basic usage - just get metadata
info_table = ENATool.fetch('PRJNA335681')
# Specify custom directory
info_table = ENATool.fetch('PRJNA335681', path='data/my_project')
# Get metadata AND download files
info_table, downloads = ENATool.fetch('PRJNA335681', download=True)
# Show some basic stats
print(f"Total samples: {len(info_table)}")
print(f"Organisms: {info_table['scientific_name'].unique()}")
print(f"Platforms: {info_table['instrument_platform'].value_counts()}")
What you get:
- Sample accessions and metadata
- Run accessions and sequencing details
- FASTQ file URLs and checksums
- Organism and experimental information
- Interactive HTML report
Download FASTQ Files
import ENATool
# Get metadata AND download files
info_table, downloads = ENATool.fetch('PRJNA335681', download=True)
# Check results
print(downloads['download_status'].value_counts())
Download status values:
OK- Successfully downloadedExists- File already exists (skipped)Error- Download failed
Download only a subset of samples
import ENATool
# Get metadata
info = ENATool.fetch('PRJNA335681')
# Filter samples
human_samples = info[info['scientific_name'] == 'Homo sapiens']
# ! Important !
# Re-initialize for filtered table
human_samples.ena.reinit(info)
# Download only filtered samples
downloads = human_samples.ena.download()
# Save to CSV
human_samples.to_csv('human_samples.csv', index=False)
Leave files with incorrect md5 checksum
Prevent ENATool from automatic removal of the corrupted files.
import ENATool
# Could be used in fetch method
info_table, downloads = ENATool.fetch('PRJNA335681', download=True, keep_failed=True)
# Could be used in download method
info = ENATool.fetch('PRJNA335681')
downloads = info.ena.download(keep_failed=True)
Disable progress bar
import ENATool
# Could be used in fetch method
info_table, downloads = ENATool.fetch('PRJNA335681', download=True, NO_PROGRESS_BAR=True)
# Could be used in download method
info = ENATool.fetch('PRJNA335681')
downloads = info.ena.download(NO_PROGRESS_BAR=True)
Work with multiple datasets
import ENATool
projects = ['PRJNA335681', 'PRJEB2961', 'PRJEB28350']
for project_id in projects:
try:
info = ENATool.fetch(project_id, path=f'data/{project_id}')
print(f"✓ {project_id}: {len(info)} samples")
except Exception as e:
print(f"✗ {project_id}: {e}")
Python API Reference
ENATool.fetch(project_id, path=None, download=False)
Main entry point for fetching ENA data.
Parameters:
project_id(str): ENA project accession (e.g., 'PRJNA335681')path(str, optional): Directory for outputs (defaults to project_id)download(bool, optional): Auto-download FASTQ files (default: False)
Returns:
- DataFrame (if download=False)
- Tuple of (info_table, download_table) (if download=True)
DataFrame.ena.download()
Download FASTQ files for samples in DataFrame.
Returns:
- DataFrame with download status
📝 Citation
If you use ENATool in your research, please cite:
Tikhonova, P. (2021). ENATool: European Nucleotide Archive Data Manager
(v2.0.0). Zenodo. https://doi.org/10.5281/zenodo.17443004
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
🔗 Links
- PyPI: https://pypi.org/project/ENATool/
- GitHub: https://github.com/PollyTikhonova/ENATool
- Documentation: https://github.com/PollyTikhonova/ENATool#readme
- Bug Reports: https://github.com/PollyTikhonova/ENATool/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file enatool-2.0.0.tar.gz.
File metadata
- Download URL: enatool-2.0.0.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94e2eb295c17ed22b27c080b04d4f0410ace603cc47e5c198481d7771e7cbfde
|
|
| MD5 |
9764775148d94851557002f3580d839c
|
|
| BLAKE2b-256 |
6aae67b65af1f0a4b9ff6357040b95bc8f7b0a9a069a9ff0cad4111ed8835f8e
|
File details
Details for the file enatool-2.0.0-py3-none-any.whl.
File metadata
- Download URL: enatool-2.0.0-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c669c5d748ac83bd7138f1e37b4d05edda3af97c5e444057497016e004632b2
|
|
| MD5 |
ed53c5dcc0b1ff7c112cc797fbbd083c
|
|
| BLAKE2b-256 |
fa03c39a9686b8144d7b8d955d907719e9adad657c12cee84bab44ce37181071
|