Skip to main content

Tools to analysis biology sequence

Project description

BioSequences

PyPI - Downloads version python version PyPI - License PyPI - Wheel GitHub last commit GitHub Repo stars



关于本项目

BioSequences是一个集合了基本的常用的生物序列分析工具的包,旨在提高日常一些基本序列分析流程的工作效率,以及为大数据分析提供一些基础支持。

完整文档请看这里Document

安装

pip 安装

pip install biosequences

下载源码安装

windows下需要安装Microsoft VC++编译工具, Linux 需要安装gcc或其他编译工具。

git clone https://github.com/Dragon-GCS/BioSequences.git
cd BioSequences
python -m pip install BioSequences

示例

加载序列信息

bioseq可以从标准fasta格式的文件或NCBI/Ensemble数据库读取序列信息。当fetch方法的参数为列表时可以批量抓取目标序列。

>>> from bioseq.utils import loadFasta, fetchNCBI, fetchENS
>>> sequence1 = loadFasta("/path/to/file.fasta")
>>> bsa = fetchNCBI("NP_851335.1")
>>> actin = fetchENS("ENST00000614376")

序列基本操作

bioseq.RNAbioseq.DNAbioseq.Peptide 都继承自 bioseq.Sequence,因此三者基本操作基本一致。

  • 查看序列的基本属性

    >>> actin.GC, actin.length
    (0.5, 102)
    >>> actin.composition
    {'A': 24, 'C': 18, 'G': 33, 'T': 27}
    >>> actin.seq
    'AGAAACTTTAGCATCTGGCTAGGAGCATCTGTGGTGGCTCACCTTTCTACCTATACGTGTGAGTGGGTGACCTGAGAGGAGTACGGTGAGCATATGAGGATG'
    >>> round(bsa.weight, 1)
    69334.4
    >>> bsa.pI
    6.805
    >>> round(bsa.chargeInpH(7.4), 2)
    -13.76
    
  • DNA序列或RNA序列可以进行转录transcript(),DNA序列有translate()方法可以翻译为RNA序列。 还可以通过bioseq.config.START_CODON自定义起始密码子,以及通过修改bioseq.config.CODON——TABLE自定义密码子表。

    >>> from bioseq.config import START_CODON, CODON_TABLE
    >>> actin.transcript()
    >>> START_CODON[0] = 'AGA'
    >>> actin.transcript()
    [N-RNFSIWLGASVVAHLSTYTCEWVT-C]
    >>> CODON_TABLE["AAC"] = "Y"
    >>> actin.transcript()
    [N-RYFSIWLGASVVAHLSTYTCEWVT-C]
    
  • 两个相同类型的序列可以进行拼接

    >>> from bioseq import DNA
    >>> dna1 = DNA("ATCG")
    >>> dna2 = DNA("GCAT")
    >>> dna1 + dna2
    "5'-ATCGGCAT-3'"
    >>> dna2 + dna1
    "5'-GCATATCG-3'"
    
  • 通过mutation()方法对序列进行修改

    >>> dna1.mutation("ATC", "GGG")
    'GGGG'
    >>> dna1.mutation(0, "AT")
    'ATGG'
    >>> dna1.mutation([0, 3], "C")
    'CTGC'
    
  • Sequence用C语言实现了Needleman-Wunsch全局比对和Smith-Waterman局部比对两种基本的序列匹配算法,可以用来快速比对序列(局部比对仅返回匹配的局部序列)。

    >>> DNA("GCATGCT").align("GATTACA")
    ('GCA-TGCT', 'G-ATTACA', -4.0)
    >>> DNA("GCATGCT").align("GATTACA", 2)
    ('AT', 'AT', 4.0)
    

    比对返回的前两个参数为比对后的序列,第三个参数为匹配得分,可以通过bioseq.utils.printAlign()来优化比对结果的显示。

    >>> from bioseq.utils import printAlign
    >>> seq1, seq2, score = DNA("GCATGCT").align("GATTACA")
    >>> printAlign(seq1, seq2)
    1 GCA-TGCT
      ┃━┃━┃•┃•
    1 G-ATTACA
    

    可以通过修改bioseq.config.AlignmentConfig来修改匹配时的罚分,默认为MATCH(2.0), MISMATCH(-3.0), GAP_OPEN: (-3.0), GAP_EXTEND(-3.0)

    >>> from bioseq.config import AlignmentConfig
    >>> AlignmentConfig.GAP_OPEN = -10
    >>> DNA("GCATGCT").align("GATTACA")
    ('GCATGCT', 'GATTACA', -6.0)
    

贡献者

@Dragon-GCS @laxtiz

致谢

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BioSequences-1.1.5.tar.gz (34.6 kB view hashes)

Uploaded Source

Built Distributions

BioSequences-1.1.5-cp38-cp38-win_amd64.whl (37.9 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

BioSequences-1.1.5-cp38-cp38-manylinux2014_x86_64.whl (51.0 kB view hashes)

Uploaded CPython 3.8

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page