Skip to main content

ENCODEfetch: a command-line tool for retrieving matched case-control data and standardized metadata from ENCODE

Project description

ENCODEfetch

PyPI License

ENCODEfetch is a command-line tool and Python package for retrieving matched case–control datasets and standardized metadata from the ENCODE Project.

ENCODEfetch automates:

  • Search for ENCODE experiments by assay, target, organism, status, and more
  • Get case-control matched experiments
  • File retrieval (FASTQ, BAM, BED, bigWig, etc.) with filtering by status/assembly
  • Parallel downloads files with resumable transfers and interactive progress bars.
  • Standardized metadata outputs (manifest.tsv, metadata.jsonl).
  • Plug-and-play samplesheets for nf-core and Snakemake workflows for reproduciable analysis.
  • Interactive API returning tidy pandas.DataFrame objects of metadata with file paths for downstream analysis.

cli

🚀 Installation

From PyPI (recommended)

pip install encodefetch

From source

git clone https://github.com/khan-lab/ENCODEfetch.git
cd ENCODEfetch
pip install -e .

Requires Python 3.9+. or newer versions.

🔧 Command-line usage

encodefetch --assay-title "TF ChIP-seq" \
             --target-label BRD4,SMAD3 \
             --organism "Homo sapiens" \
             --file-type fastq \
             --status released \
             --progress \
             --download \
             --threads 8 \
             --nfcore

Key options

  • --accessions ENCSR123ABC,ENCSR456DEF — fetch experiments by accession directly.
  • --accessions accessions.txt — read one experiment accession per line; blank lines and # comments are ignored.
  • --assay-title — e.g. TF ChIP-seq, Histone ChIP-seq, ATAC-seq, RNA-seq.
  • --target-label — one or more targets (comma-separated).
  • --organism — e.g. Homo sapiens, Mus musculus.
  • --file-type — restrict formats (fastq, bam, bed, bigWig…).
  • --status — default released (can also include archived).
  • --perturbed true|false — filter perturbed experiments.
  • --download — actually download matched files.
  • --threads — number of worker threads for metadata fetching, control fetching, and downloads.
  • --max-retries / --chunk-size — tune download retry count and streamed chunk size.
  • --nfcore / --snakemake — export pipeline-ready sample sheets.
  • --control-strategy all|pool|best — choose how samplesheets represent multiple controls.

Run encodefetch --help to see all options.

📦 Outputs

After a run, outdir/ contains:

  • manifest.tsv — tidy table of case/control files with metadata.
  • metadata.jsonl — raw record dump (one JSON per line).
  • files/ — downloaded files, organized by experiment/control.
  • nfcore_*_samplesheet.csv — optional nf-core samplesheet.
  • snakemake_samples.tsv — optional Snakemake sample table.

ENCODEfetch preserves all experiment-level controls in matched_control_experiments. File-level controlled_by links are normalized into controlled_by_files when ENCODE provides them. Samplesheets prefer file-level control mappings, then fall back to experiment-level controls; --control-strategy all duplicates case rows per control, pool joins controls with semicolons, and best currently chooses the first control deterministically.

🐍 Python API

import encodefetch as ef

# Search experiments
metadata, recs = ef.search_experiments(
    assay_title="TF ChIP-seq",
    target_labels=["BRD4","SMAD3"],
    organism="Homo sapiens",
    file_types={"fastq"},
    status="released",
    progress=False,
    threads=2,
)

metadata.head()

# Collapse paired-end FASTQs to one row
metadata_collapse = ef.collapse_fastq_pairs(metadata)

# Write nf-core samplesheet
ef.write_nfcore_sheet(metadata_collapse, "nfcore_chipseq.csv")

🧬 Assay support

ENCODEfetch currently provides assay-aware normalization and exporters to nf-core/snakemake samplesheets for:

  • ChIP-seq (production)
  • ATAC-seq (in beta)
  • RNA-seq (in beta)
  • more to be added ..

Each assay can plug in its own normalization (e.g., FASTQ collapsing, strandedness detection) and samplesheet exporters.

🤝 Contributing

Contributions are welcome!

  • Add new assay classes under encodefetch/assays/.
  • Add new exporters under encodefetch/exporters/.
  • Extend metadata fields in build_file_record.

See CONTRIBUTING.md for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

encodefetch-0.4.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

encodefetch-0.4.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file encodefetch-0.4.0.tar.gz.

File metadata

  • Download URL: encodefetch-0.4.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for encodefetch-0.4.0.tar.gz
Algorithm Hash digest
SHA256 cc96921c6a651f0c9268afed464825e075950673a1d194ed884584ae92ad37b4
MD5 9bff864932ecaf8a117611a1b9806630
BLAKE2b-256 1884f9fe8af91072019cfa1bc35737d797a67b7ae9f7b11eea6343dcf45d7636

See more details on using hashes here.

File details

Details for the file encodefetch-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: encodefetch-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for encodefetch-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1894150e127cdeb25ab3fca8bd76047bd9e7e5b53cf05d9cfdb0525cfeb28207
MD5 644caacdbfd5926d5b09a0cc6e3a70bb
BLAKE2b-256 cd06319ce6b2ed5082f5395fe17e8f846764722212e6c0970841ec97deebd925

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page