Skip to main content

ENCODEfetch: a command-line tool for retrieving matched case-control data and standardized metadata from ENCODE

Project description

ENCODEfetch

PyPI License

ENCODEfetch is a command-line tool and Python package for retrieving matched case–control datasets and standardized metadata from the ENCODE Project.

ENCODEfetch automates:

  • Search for ENCODE experiments by assay, target, organism, status, and more
  • Get case-control matched experiments
  • File retrieval (FASTQ, BAM, BED, bigWig, etc.) with filtering by status/assembly
  • Parallel downloads files with resumable transfers and interactive progress bars.
  • Standardized metadata outputs (manifest.tsv, metadata.jsonl).
  • Plug-and-play samplesheets for nf-core and Snakemake workflows for reproduciable analysis.
  • Interactive API returning tidy pandas.DataFrame objects of metadata with file paths for downstream analysis.

cli

🚀 Installation

From PyPI (recommended)

pip install encodefetch

From source

git clone https://github.com/khan-lab/ENCODEfetch.git
cd ENCODEfetch
pip install -e .

Requires Python 3.9+. or newer versions.

🔧 Command-line usage

encodefetch --assay-title "TF ChIP-seq" \
             --target-label BRD4,SMAD3 \
             --organism "Homo sapiens" \
             --file-type fastq \
             --status released \
             --progress \
             --download \
             --threads 8 \
             --nfcore

Key options

  • --accessions ENCSR123ABC,ENCSR456DEF — fetch experiments by accession directly.
  • --accessions accessions.txt — read one experiment accession per line; blank lines and # comments are ignored.
  • --assay-title — e.g. TF ChIP-seq, Histone ChIP-seq, ATAC-seq, RNA-seq.
  • --target-label — one or more targets (comma-separated).
  • --organism — e.g. Homo sapiens, Mus musculus.
  • --file-type — restrict formats (fastq, bam, bed, bigWig…).
  • --status — default released (can also include archived).
  • --perturbed true|false — filter perturbed experiments.
  • --download — actually download matched files.
  • --threads — number of worker threads for metadata fetching, control fetching, and downloads.
  • --max-retries / --chunk-size — tune download retry count and streamed chunk size.
  • --nfcore / --snakemake — export pipeline-ready sample sheets.
  • --control-strategy all|pool|best — choose how samplesheets represent multiple controls.

Run encodefetch --help to see all options.

📦 Outputs

After a run, outdir/ contains:

  • manifest.tsv — tidy table of case/control files with metadata.
  • metadata.jsonl — raw record dump (one JSON per line).
  • files/ — downloaded files, organized by experiment/control.
  • nfcore_*_samplesheet.csv — optional nf-core samplesheet.
  • snakemake_samples.tsv — optional Snakemake sample table.

ENCODEfetch preserves all experiment-level controls in matched_control_experiments. File-level controlled_by links are normalized into controlled_by_files when ENCODE provides them. Samplesheets prefer file-level control mappings, then fall back to experiment-level controls; --control-strategy all duplicates case rows per control, pool joins controls with semicolons, and best currently chooses the first control deterministically.

🐍 Python API

import encodefetch as ef

# Search experiments
metadata, recs = ef.search_experiments(
    assay_title="TF ChIP-seq",
    target_labels=["BRD4","SMAD3"],
    organism="Homo sapiens",
    file_types={"fastq"},
    status="released",
    progress=False,
    threads=2,
)

metadata.head()

# Collapse paired-end FASTQs to one row
metadata_collapse = ef.collapse_fastq_pairs(metadata)

# Write nf-core samplesheet
ef.write_nfcore_sheet(metadata_collapse, "nfcore_chipseq.csv")

🧬 Assay support

ENCODEfetch currently provides assay-aware normalization and exporters to nf-core/snakemake samplesheets for:

  • ChIP-seq (production)
  • ATAC-seq (in beta)
  • RNA-seq (in beta)
  • more to be added ..

Each assay can plug in its own normalization (e.g., FASTQ collapsing, strandedness detection) and samplesheet exporters.

🤝 Contributing

Contributions are welcome!

  • Add new assay classes under encodefetch/assays/.
  • Add new exporters under encodefetch/exporters/.
  • Extend metadata fields in build_file_record.

See CONTRIBUTING.md for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

encodefetch-0.3.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

encodefetch-0.3.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file encodefetch-0.3.0.tar.gz.

File metadata

  • Download URL: encodefetch-0.3.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for encodefetch-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3b29b04e5edf150cde69cb13adc2ecab1b4b9dc60dd3bfd6050f44aa3825aa7a
MD5 940a26cb1676ed5986bee9735b460145
BLAKE2b-256 bfea022d77959c8361e8d7bc07cddc01f519a01800cce0c823362b2fd8bfdbd6

See more details on using hashes here.

File details

Details for the file encodefetch-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: encodefetch-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for encodefetch-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32f34d14264bc1850088aed85c65fd56701a16944c2d7a4067277c7099888702
MD5 a48b9269d327ad037fc827c550f8c031
BLAKE2b-256 4c4ba318b5c9189d4417e38be2264ff4298f838743fe7724390e71d1d91f0c39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page