Skip to main content

Instant file previews for genomics data

Project description

peek-bio

Instant file previews for genomics data. One command, any format.

# via bioconda (recommended)
conda install -c bioconda peek-bio

# or via pip
pip install peek-bio

peek demo

What it does

Point peek at a file and get a structured summary: row counts, column types, quality scores, variant stats, mapping rates, QC warnings. No scripts, no notebooks, no googling command flags.

$ peek deseq2_results.csv

 deseq2_results.csv — >10,553 x 7 (CSV, comma-separated)
 ────────────────────────────────────────────────────────────────────
 Columns:
                   str    0610005C13Rik, 0610009B22Rik, ...  (1,000 unique)
   baseMean        float  3.92 … 1983.92  (median: 25.32, mean: 59.32)  ⡇⡀⡀⡀⡀⡀⡀⡀⡀⡀
   log2FoldChange  float  -3.29 … 3.60  (median: -0.02, mean: -0.04)    ⡀⡀⡀⡀⡀⡇⡄⡀⡀⡀⡀⡀
   lfcSE           float  0.11 … 1.23  (median: 0.35, mean: 0.40)       ⡄⡇⡆⡄⡄⡄⡀⡀⡀⡀⡀⡀
   stat            float  -5.94 … 8.10  (median: -0.06, mean: -0.11)    ⡀⡀⡀⡀⡇⡆⡄⡀⡀⡀⡀
   pvalue          float  5.46e-16 … 1.00  (median: 0.37, mean: 0.45)   ⡇⡄⡄⡄⡄⡄⡄⡀⡄⡄⡄⡄
   padj            float  3.42e-13 … 1.00  (median: 0.95, mean: 0.81)   ⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡇

 Missing:  pvalue (1)
$ peek NA12878.bam

 NA12878.bam — 61,614 reads (BAM, indexed)
 ────────────────────────────────────────────────────────────────────
 Reference:  3366 sequences, 3.2 Gb  [GRCh38 (with alts)]
 Reads:  60,749 mapped (98.6%), 865 unmapped
 Flags:  0.1% duplicates, 1.5% supplementary
 Paired:  yes (2×250 bp)
 Insert size:  mean 449  median 428  range 100–999  ⡀⡀⡀⡀⡆⡇⡄⡄⡄⡀⡀⡀
 Read groups:  3  (NA12878, NA12878, NA12878)
 Sort order:  coordinate
 Programs:  bwamem, MarkDuplicates, GATK ApplyBQSR
 MAPQ:  mean 55.3  median 60  ⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡇
$ peek ERR188273_chrX_1.fq.gz

 ERR188273_chrX_1.fq.gz — 30,531 reads, 2.3 Mb (FASTQ, Phred+33)
 ────────────────────────────────────────────────────────────────────
 Read length:  all 75 bp
 Quality:  mean Q36.7  median Q38  range Q2–Q41  ⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡆⡇
 GC content:  48.9%
$ peek clinvar.vcf.gz

 clinvar.vcf.gz — 4,403,650 variants (VCF)
 ────────────────────────────────────────────────────────────────────
 Variants:  4,103,565 snps, 93,659 insertions, 194,377 deletions, 12,049 complexes
 Ts/Tv:  1.69
 FILTER:  4,403,650 PASS
 Chroms:  32 total — top: 1 (398,195), 2 (384,641), 17 (265,676)
$ peek filtered_feature_bc_matrix/matrix.mtx.gz

 matrix.mtx.gz (12.3 MB) — 8,421 cells x 33,538 genes (Matrix Market, coordinate, integer)
 ────────────────────────────────────────────────────────────────────
 Non-zero:  17,438,362 entries (93.8% sparse)
 Cells:  8,421
 Features:  33,538
 Mean nnz/cell:  2,071
 Feature types:  33,538 Gene Expression
 Companions:  barcodes.tsv.gz, features.tsv.gz

Supported formats

Core (no extra dependencies):

Format Extensions
CSV/TSV .csv, .tsv, .txt
BED .bed, .narrowPeak, .broadPeak, .bedGraph
FASTA .fa, .fasta
FASTQ .fq, .fastq
VCF .vcf, .vcf.gz
MTX .mtx
GTF/GFF .gtf, .gff, .gff3

Optional (install what you need):

Format Extensions Install
SAM/BAM/CRAM .sam, .bam, .cram pip install peek-bio[bam]
Excel .xlsx, .xls pip install peek-bio[excel]
BigWig .bw, .bigwig pip install peek-bio[bigwig]
H5AD .h5ad pip install peek-bio[h5ad]

Or install everything: pip install peek-bio[all]

Files with non-standard extensions (or no extension at all) are detected automatically from their content.

Directory scan

Point peek at a folder to get an instant inventory of all genomics files:

$ peek data/

 data/ — 30 genomics files, 3.1 GB
 ────────────────────────────────────────────────────────────────────
   1 FASTA  all_ref_sva.fa
   11 BAM/SAM/CRAM  (3 indexed)
   1 VCF  candidate_EOPC_variants.vcf.gz
   4 BED  CEBPG.bed, ENCFF363RKC.bed, ...
   3 GTF/GFF  fimo_HP.gff, fimo_cobound.gff, gencode.v38.basic.annotation.gtf.gz
   1 BigWig  k562_MNase.bw
   2 H5AD  neurips_bmmc.h5ad, pbmc68k.h5ad
   2 Excel  Oct4_RS-matrix_Rep1-Apr-2021.xlsx, nature_genetics_supp.xlsx
   5 CSV/TSV

Detects FASTQ pairs (R1/R2), indexed BAMs, and skips hidden files.

Paired FASTQ comparison

Give peek two FASTQ files and it automatically compares them side by side:

$ peek sample_R1.fq.gz sample_R2.fq.gz

 Paired FASTQ Comparison
 ────────────────────────────────────────────────────────────────────
                 sample_R1.fq.gz         sample_R2.fq.gz
          Reads  1,204,881               1,204,881               ✓
       Total bp  90.4 Mb                 90.4 Mb
    Read length  75 bp                   75 bp                   ✓
   Mean quality  Q36.7                   Q34.2                   ✓
     GC content  48.9%                   49.1%                   ✓
       Encoding  Phred+33                Phred+33                ✓

Mismatched read counts (broken pairing) are flagged with a QC warning.

Explain mode

New to bioinformatics? Add --explain for plain-English annotations of every metric:

$ peek --explain variants.vcf

 Ts/Tv:  1.85
   ↳ Transition/transversion ratio. Transitions (A<>G, C<>T) are chemically
     favored over transversions (all other changes). Whole-genome ~2.0, exome ~2.8.
     A low ratio can indicate sequencing artifacts or contamination.

 MAPQ:  mean 55.3  median 60
   ↳ Mapping quality: confidence that each read is aligned to the correct position.
     60 = near-certain. 0 = equally likely at multiple locations. Below 20 = ambiguous.

QC warnings

peek flags common issues automatically:

  • Unusual GC content (outside 25-65%)
  • High N content in assemblies (>20%)
  • Low mean base quality in FASTQ (<Q20)
  • Adapter contamination in FASTQ (>5%)
  • Low mapping rate in BAM/SAM (<80%)
  • Low MAPQ scores (<20 mean)
  • High duplicate rate (>30%)
  • Ts/Tv ratio out of range in VCF
  • Low genotype rate in multi-sample VCF (<90%)
  • No gene features or missing gene_id in GTF
  • Single-chromosome GTF (possible subset)
  • Columns with >50% missing data
  • Mixed-type columns (numbers and strings mixed together)

Usage

peek FILE [FILE ...]          # preview one or more files
peek DIRECTORY                # scan a folder for genomics files
peek -r DIRECTORY             # scan recursively (include subdirectories)
peek R1.fq R2.fq              # compare paired FASTQ files
peek https://example.com/f.vcf  # preview a file from a URL
peek --explain FILE           # add plain-English explanations
peek --head 20 FILE           # show 20 preview rows instead of 5
peek --no-color FILE          # plain text output (no ANSI colors)
peek --formats                # list all supported formats + install status
peek --version                # print version

Compressed files (.gz) are handled transparently. URLs are downloaded to a temp file and cleaned up automatically.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

peek_bio-0.2.0.tar.gz (103.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

peek_bio-0.2.0-py3-none-any.whl (85.9 kB view details)

Uploaded Python 3

File details

Details for the file peek_bio-0.2.0.tar.gz.

File metadata

  • Download URL: peek_bio-0.2.0.tar.gz
  • Upload date:
  • Size: 103.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for peek_bio-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d71096cd0c9ef1d887670cf18d45c902f70ce38507750b52545a22c530565d9f
MD5 00510e855454deb3506c8618a97ffb9e
BLAKE2b-256 5664c5eeb41c4b00918c1bd65551f3bdb5aa830783218f9639c3fd480b632419

See more details on using hashes here.

File details

Details for the file peek_bio-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: peek_bio-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 85.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for peek_bio-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 57a77b5d65f7299d8fbc4adbcc52f0db566553ec7cb8e024ac5019a18dce1c82
MD5 28dca48deba2809e24cdff0a1eb11ad2
BLAKE2b-256 038b9c05e89c149720d45a388ba3eb0294c29499059b23570436356cfa6e5b00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page