Skip to main content

minimal shared K-mers across the whole transcriptome

Project description

Step 1: minimal-shared region filtering

kmer_frequency_distribution_mini_shared.py

This tool processes a FASTA file containing transcript sequences and outputs a set of CSV files that summarize the k-mer content for each transcript. Each CSV file contains a list of k-mers of specified length that are present in the transcript, along with their local and global frequency, and information on which transcripts each k-mer is present in.

Features K-mer Counting: For a given transcriptome FASTA file, count all k-mers of a specified length (default is set to 50).
Minimal Shared K-mers Output: For each transcript, output the k-mers that have the minimum global frequency—the smallest number of transcripts in which the k-mer appears.
CSV Output Content: Generate a CSV file for each isoform with the following columns:

'kmer': The k-mer sequence.
'Local_Frequency': The number of times the k-mer appears in the specific isoform.
'Global_Frequency': The number of transcripts that contain the k-mer across the entire transcriptome.
'Present_in_Transcripts': A list of transcript identifiers that share the k-mer if its global frequency is more than 1. For unique k-mers, the identifier of the single transcript is given.

Installation

To install the specific version (0.1.0) of the package minimal-shared-kmers using pip, run the following command in your terminal:

pip install minimal-shared-kmers==0.1.0

Usage To use this tool, you need to have Python installed on your system. The script requires a FASTA file with the transcript sequences as input and a directory path where the CSV files will be saved as output.

Execute the script with the necessary arguments from the command line. For example:

python kmer_frequency_distribution_mini_shared.py --input path/to/your/ACTB_reference/mart_export_ACTB.txt --output path/to/output/directory/

Command-Line Arguments
--input: Path to the input FASTA file containing transcript sequences (https://useast.ensembl.org/biomart/martview/aeb3390f02325ab7951be9a7d6daaa42).
--output: Path to the output directory where CSV files for each transcript will be saved.

Output File Details For each transcript in the input FASTA file, the script will create a corresponding CSV file in the output directory with a name derived from the transcript header, sanitized to be filesystem-friendly.

In the output CSV files for each transcript, only k-mers that have the smallest global frequency for that transcript are included. If multiple k-mers share the same smallest global frequency, then all such k-mers are included in the CSV file. The 'Present_in_Transcripts' field in the CSV may include multiple transcript names, indicating that those transcripts share the k-mer.

If the global frequency of a k-mer is 1, indicating that it is unique to a single transcript, then the 'Present_in_Transcripts' field will only contain the identifier of that specific transcript.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minimal_shared_kmers-0.1.1.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

minimal_shared_kmers-0.1.1-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file minimal_shared_kmers-0.1.1.tar.gz.

File metadata

  • Download URL: minimal_shared_kmers-0.1.1.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.4

File hashes

Hashes for minimal_shared_kmers-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ffc78397b81fc115c29a87c81bceaade096d3b51f27b2eda6c751e9b9f5a78bc
MD5 f547725ab8a7e03290cbcf4a30d39093
BLAKE2b-256 597450bf27f072c94e4b7ac37cfab7c7e72dd46bab98d04cdce98b9e577af770

See more details on using hashes here.

File details

Details for the file minimal_shared_kmers-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for minimal_shared_kmers-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e386424013bbbd175f80f69f9d7da61e3770e06555acf59f511e28bb2e2b3d8
MD5 4d104493c864d4ec7350ed4f80c1237e
BLAKE2b-256 db0f30d8178d2edefbcf75557365f2e96bdaebf9faeda3f6de7f92fd10429be2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page