Skip to main content

Inference of ploidy and heterozygosity structure using whole genome sequencing data

Project description

Smudgeplot

Version: 0.5.3b Skylight

Authors: Sam Ebdon, Gene W Myers and Kamil S. Jaron, Tianyi Ma.

Installation

This version of smudgeplot operates on FastK k-mer databases. The smudgeplot installation consists of a python package and C-backend to search for all the k-mer pairs (smudgeplot hetmers) and extract sequences of k-mer pairs (smudgeplot extract).

We recommend installing smudgeplot within a conda environment.

#optional conda environment setup
conda create -n smudgeplot && conda activate smudgeplot
conda install pip

# install via pypi
pip install smudgeplot

# or download and install directly. See below if you need to compile the C dependencies.
git clone https://github.com/KamilSJaron/smudgeplot.git
cd smudgeplot && pip install .
smudgeplot -h # check installation succeeded

That should do everything necesarry to make smudgeplot fully operational. You can run smudgeplot --help to see if it worked. If you activated a virtual environment prior to installation (either conda or any other) then smudgeplot installed within the environment.

Note the smudgeplot version downloadable from conda itself is not currently up to date.

Compiling the C code

The process above should install everything including compilation of the C backend. If you need or would like to know how to compile the code yourself you can simply run

make

This will not, however, install the smudgeplot python package.

Pypi installation [EXPERIMENTAL]

We are working on packaging smudgeplot for pypi. You are welcome to try installing from pypi if you are interested and please open an issue if you have problems. If it fails please follow the main instructions above to install for now.

pip install smudgeplot

Example run on Saccharomyces data

Requires ~2.1GB of space and FastK and smudgeplot installed.

# download data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_2.fastq.gz

# sort them in a reasonable place
mkdir data/Scer
mv *fastq.gz data/Scer/

# run FastK to create a k-mer database
FastK -v -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table

# Find all k-mer pairs in the dataset using hetmer module
smudgeplot hetmers -L 12 -t 4 -o data/Scer/kmerpairs --verbose data/Scer/FastK_Table
# this now generated `data/Scer/kmerpairs_text.smu` file;
# it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages)

# use the .smu file to infer ploidy and create smudgeplot
smudgeplot all -o data/Scer/trial_run data/Scer/kmerpairs.smu

# check that bunch files are generated (3 pdfs; some summary tables and logs)
ls data/Scer/trial_run_*

The y-axis scaling is by default 100, one can spcify argument ylim to scale it differently

smudgeplot all -o data/Scer/trial_run_ylim70 data/Scer/kmerpairs.smu -ylim 70

There is also a plotting module that requires the coverage, a list of smudges, and the smudge sizes listed in a tabular file. This plotting module does not perform the inference and should be used only if you know the right answers already.

How smudgeplot works

This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.

Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example (of an older version):

smudgeexample

Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy.

This tool is planned to be a part of GenomeScope in the near future.

More usage information

The input is a set of whole genome sequencing reads, the higher the coverage the better. The method is designed to process big datasets, don't hesitate to pull all single-end/pair-end libraries together.

The workflow is automatic, but it's not fool-proof. It requires some decisions. Use this tool jointly with GenomeScope.

Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. Note that a single polymorphism generates multiple heterozygous kmers. As such, the reported numbers do not directly correspond to the number of variants. Instead, the actual number is approximately 1/k times the reported numbers, where k is the kmer size (in summary already recalculated). It's important to note that this process does not exhaustively attempt to find all of the heterozygous kmers from the genome. Instead, only a sufficient sample is obtained in order to identify relative genome structure. You can also report the minimal number of loci that are heterozygous if the inference is correct.

GenomeScope

You can feed the kmer coverage histogram to GenomeScope. (Either run the genomescope or use the web server)

This tool estimates the size, heterozygosity, and repetitive fraction of the genome. By inspecting the fitted model you can determine the location of the smallest peak after the error tail. Then, you can decide the low end cutoff below which all kmers will be discarded as errors (cca 0.5 times the haploid kmer coverage), and the high end cutoff above which all kmers will be discarded (cca 8.5 times the haploid kmer coverage).

Frequently Asked Questions

Are collected on our wiki. Smudgeplot does not demand much computational resources, but make sure you check memory requirements before you extract kmer pairs (hetkmers task). If you don't find an answer for your question in FAQ, open an issue or drop us an email.

Check projects to see how the development goes.

Contributions

This is definitely an open project, contributions are welcome. You can check some of the ideas for the future in projects and in the development dev branch. The file playground/DEVELOPMENT.md contains some development notes. The directory playground contains some snippets, attempts, and other items of interest.

Reference

Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications 11, 1432 (2020). https://doi.org/10.1038/s41467-020-14998-3

Acknowledgements

This blogpost by Myles Harrison has largely inspired the visual output of smudgeplots. The colourblind friendly colour theme was suggested by @ggrimes. Grateful for helpful comments of beta testers and pre-release chatters!

## Changelog

#### 0.5

  • experimental feature to extract sequences of the kmers in the pair; this functionality will hopefully restore at some point together with functionality to assess the quality of assembly.
  • histograms are back

#### 0.4

  • the search for the kmer pair will be within both canonical and non-canonical k-mer sets (Gene demonstrated it makes a difference)
  • the tool will be supporting FastK kmer counter only
  • the backend by Gene is paralelized and massively faster
  • the intermediate file will be a flat file with the 2d histogram with cov1, cov2, freq columns (as opposed to list of coverages of pairs cov1 cov2);
  • completelly revamped plot showing how all individual kmer pairs insead of agregating them into squares
  • new smudge detection algorithm based on grid projection on the smudge plane (working, but under revisions at the moment)
  • R package smudgeplot was retired and is no longer used

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smudgeplot-0.5.3b0.tar.gz (59.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

smudgeplot-0.5.3b0-cp312-cp312-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.12+ x86-64manylinux: glibc 2.17+ x86-64

smudgeplot-0.5.3b0-cp312-cp312-macosx_11_0_x86_64.whl (126.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ x86-64

smudgeplot-0.5.3b0-cp312-cp312-macosx_11_0_arm64.whl (120.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

smudgeplot-0.5.3b0-cp311-cp311-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.12+ x86-64manylinux: glibc 2.17+ x86-64

smudgeplot-0.5.3b0-cp311-cp311-macosx_11_0_x86_64.whl (126.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

smudgeplot-0.5.3b0-cp311-cp311-macosx_11_0_arm64.whl (120.5 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

smudgeplot-0.5.3b0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.12+ x86-64manylinux: glibc 2.17+ x86-64

smudgeplot-0.5.3b0-cp310-cp310-macosx_11_0_x86_64.whl (126.9 kB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

smudgeplot-0.5.3b0-cp310-cp310-macosx_11_0_arm64.whl (120.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

smudgeplot-0.5.3b0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (121.7 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64manylinux: glibc 2.17+ x86-64

smudgeplot-0.5.3b0-cp39-cp39-macosx_11_0_x86_64.whl (126.9 kB view details)

Uploaded CPython 3.9macOS 11.0+ x86-64

smudgeplot-0.5.3b0-cp39-cp39-macosx_11_0_arm64.whl (120.5 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file smudgeplot-0.5.3b0.tar.gz.

File metadata

  • Download URL: smudgeplot-0.5.3b0.tar.gz
  • Upload date:
  • Size: 59.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for smudgeplot-0.5.3b0.tar.gz
Algorithm Hash digest
SHA256 c3a8eb94407accf1076b5a3c9acc16a4375fadb5fe940f533f66092d7d71b436
MD5 c6f97b437cd856ef0524dae1255f22fd
BLAKE2b-256 1c5b33f7e5ab6aee85049462b29c9c490bebf2528b5baf1e81401d071afa1c7d

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp312-cp312-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp312-cp312-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a6512d949ce9874688fad798db69a7cd1a606e17ff215673d9a828bfe33465ac
MD5 2a4f1f047942ddc6f6a29c57d31b82ed
BLAKE2b-256 03d81fe62c145f1e35770274244e2af34773b5454055621ffea3e1eb5b3f44e7

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 9c5ce7bc23a92e948675d8452451917e6c83d937b00242da4556dce09349df30
MD5 2b1dc7ceaaa71538f16863a2cf5a71c9
BLAKE2b-256 604b437fa6c61af826acf3c6ed7a0ec2b45df355c1c31fe5c7b73ec58797e78a

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6e98af46b4e807318d9d4e93a9147f3a6b31f360fdaeb40954baa0faa7e3122e
MD5 b2963a8f52d9e687a01b8f5236896452
BLAKE2b-256 1d89ae2ead0a7fe5e5d3a5d6c37b2c12278ac083724b505ca5a4a2bb96e31407

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp311-cp311-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp311-cp311-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 64461e33da6baaeae3fbd0addff770509730ce341918e3e5920b8d40adf17acd
MD5 4e3cf88bdd2980d5c0d5eafdcb8a0c4c
BLAKE2b-256 c28a4f190c18b73cdbba90ae7f330786673906e4785f31a05e83a010904f45d5

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 a73138e4e893251909d7cbaf2813fcb18689be14a9caabc7a26e1899415177b2
MD5 e3f45c6925d4827689d12d15b1c0e524
BLAKE2b-256 70490549ea67b6490852e77d334345d75a211f9d311cc9ef02b1387bb99eb9db

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 95319b09acf090445222f5787c5ffcda7aa693fc98b96f84645c9b2f45ece54b
MD5 e84bb235d012db2ff4916cba55e1c57a
BLAKE2b-256 964cdd9b3c1b975e43cd3ff6da33d5be6ab4ef360a1f760760583219161e95aa

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a067d2e9967870eb9efbd29593a55509ef9e0ea337d0fd14f4d1f642e693f6f0
MD5 e2432178ccf320e52626fc797dea30bd
BLAKE2b-256 c5e3611045ca35beec4c3ebb443edb696d01937b21c0d6c0b20bbe327983f0c0

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 b2490f8cc720bf502fc2aa24b8d95a2b427865a57b7db2d4575b99c9b9524cef
MD5 939add4f31d2b084624896d241c83cfe
BLAKE2b-256 399fefc283595aa9743f5f10494e4be4d60763376cf9fa44645b6f56e22c190d

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 acc21b53c04b41680546322b42055a2095ca5acbca732d14a18c2705248d48db
MD5 8d79f8f535e70e31e70b8ca862898211
BLAKE2b-256 05c08697cca0ab493c80a3a8cdd8a1ded96349242aaa688d2f4152e16295a0b5

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af0a21dfc6ea6cfc0f26481298723b54da2d609d288ccea21b1e0cbdbff07eea
MD5 2da0ec610fdd8d47dc5fcea25d0fc398
BLAKE2b-256 7dfe78a83eab2d4d8ca935d88b130cda590282f93632f18965740558cc4dd44b

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 d7ba4597edb9ea7fcc6cb0a4f3c574880e8856d9d084716f1513c43db6f25e63
MD5 89d8af02563fbd0c065a8cdb4c81d4e1
BLAKE2b-256 e352efc4d776ac33542ff27c786d2549cd50e2b512a944af29c2c38d58380bf5

See more details on using hashes here.

File details

Details for the file smudgeplot-0.5.3b0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smudgeplot-0.5.3b0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3e0616d4ba7688322af95fc21cf8d22201adba1a31fc53e3e71b0a67e423a1af
MD5 d726ca7d28bd0e527105230662be8966
BLAKE2b-256 d875fc4d8db8fee188b5209829974dd7648fa9e9ec3d3e5351f02809ac4f064d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page