Skip to main content

Tools to analysis biology sequence

Project description

BioSequences

PyPI - Downloads version python version PyPI - License PyPI - Wheel GitHub last commit GitHub Repo stars



关于本项目

BioSequences是一个集合了基本的常用的生物序列分析工具的包,旨在提高日常一些基本序列分析流程的工作效率,以及为大数据分析提供一些基础支持。

完整文档请看这里Document

安装

pip 安装

pip install biosequences

下载源码安装

windows下需要安装Microsoft VC++编译工具, Linux 需要安装gcc或其他编译工具。

git clone https://github.com/Dragon-GCS/BioSequences.git
cd BioSequences
python -m pip install BioSequences

示例

加载序列信息

bioseq可以从标准fasta格式的文件或NCBI/Ensemble数据库读取序列信息。当fetch方法的参数为列表时可以批量抓取目标序列。

>>> from bioseq.utils import loadFasta, fetchNCBI, fetchENS
>>> sequence1 = loadFasta("/path/to/file.fasta")
>>> bsa = fetchNCBI("NP_851335.1")
>>> actin = fetchENS("ENST00000614376")

序列基本操作

bioseq.RNAbioseq.DNAbioseq.Peptide 都继承自 bioseq.Sequence,因此三者基本操作基本一致。

  • 查看序列的基本属性

    >>> actin.GC, actin.length
    (0.5, 102)
    >>> actin.composition
    {'A': 24, 'C': 18, 'G': 33, 'T': 27}
    >>> actin.seq
    'AGAAACTTTAGCATCTGGCTAGGAGCATCTGTGGTGGCTCACCTTTCTACCTATACGTGTGAGTGGGTGACCTGAGAGGAGTACGGTGAGCATATGAGGATG'
    >>> round(bsa.weight, 1)
    69334.4
    >>> bsa.pI
    6.805
    >>> round(bsa.chargeInpH(7.4), 2)
    -13.76
    
  • DNA序列或RNA序列可以进行转录transcript(),DNA序列有translate()方法可以翻译为RNA序列。 还可以通过bioseq.config.START_CODON自定义起始密码子,以及通过修改bioseq.config.CODON——TABLE自定义密码子表。

    >>> from bioseq.config import START_CODON, CODON_TABLE
    >>> actin.transcript()
    >>> START_CODON[0] = 'AGA'
    >>> actin.transcript()
    [N-RNFSIWLGASVVAHLSTYTCEWVT-C]
    >>> CODON_TABLE["AAC"] = "Y"
    >>> actin.transcript()
    [N-RYFSIWLGASVVAHLSTYTCEWVT-C]
    
  • 两个相同类型的序列可以进行拼接

    >>> from bioseq import DNA
    >>> dna1 = DNA("ATCG")
    >>> dna2 = DNA("GCAT")
    >>> dna1 + dna2
    "5'-ATCGGCAT-3'"
    >>> dna2 + dna1
    "5'-GCATATCG-3'"
    
  • 通过mutation()方法对序列进行修改

    >>> dna1.mutation("ATC", "GGG")
    'GGGG'
    >>> dna1.mutation(0, "AT")
    'ATGG'
    >>> dna1.mutation([0, 3], "C")
    'CTGC'
    
  • Sequence用C语言实现了Needleman-Wunsch全局比对和Smith-Waterman局部比对两种基本的序列匹配算法,可以用来快速比对序列(局部比对仅返回匹配的局部序列)。

    >>> DNA("GCATGCT").align("GATTACA")
    ('GCA-TGCT', 'G-ATTACA', -4.0)
    >>> DNA("GCATGCT").align("GATTACA", 2)
    ('AT', 'AT', 4.0)
    

    比对返回的前两个参数为比对后的序列,第三个参数为匹配得分,可以通过bioseq.utils.printAlign()来优化比对结果的显示。

    >>> from bioseq.utils import printAlign
    >>> seq1, seq2, score = DNA("GCATGCT").align("GATTACA")
    >>> printAlign(seq1, seq2)
    1 GCA-TGCT
      ┃━┃━┃•┃•
    1 G-ATTACA
    

    可以通过修改bioseq.config.AlignmentConfig来修改匹配时的罚分,默认为MATCH(2.0), MISMATCH(-3.0), GAP_OPEN: (-3.0), GAP_EXTEND(-3.0)

    >>> from bioseq.config import AlignmentConfig
    >>> AlignmentConfig.GAP_OPEN = -10
    >>> DNA("GCATGCT").align("GATTACA")
    ('GCATGCT', 'GATTACA', -6.0)
    

贡献者

@Dragon-GCS @laxtiz

致谢

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BioSequences-1.1.5.tar.gz (34.6 kB view details)

Uploaded Source

Built Distributions

BioSequences-1.1.5-cp38-cp38-win_amd64.whl (37.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

File details

Details for the file BioSequences-1.1.5.tar.gz.

File metadata

  • Download URL: BioSequences-1.1.5.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.6

File hashes

Hashes for BioSequences-1.1.5.tar.gz
Algorithm Hash digest
SHA256 288283bb56860871648fda6c7557acd34ce95cb6f4bb362ab4bf60da916a21f2
MD5 8c8fb209585fcda479b87d95d4503ee6
BLAKE2b-256 87d8d6834991f7c36ad87bf13595e31938541ca76c7a1b03717cdda0cad4d403

See more details on using hashes here.

File details

Details for the file BioSequences-1.1.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: BioSequences-1.1.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 37.9 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.6

File hashes

Hashes for BioSequences-1.1.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 0e7b077b26e3cc55bf9543998f5e483a650c4ad4f6f62255860c7cde64c5b086
MD5 ce93fbf7add119dd984d3621a7f0cf4a
BLAKE2b-256 4e5dd3afa41c747c8fc85a29f4964bdd4e637d0439c2903d41150e84094fe0fe

See more details on using hashes here.

File details

Details for the file BioSequences-1.1.5-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for BioSequences-1.1.5-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 809a3c08e4e2218a1cf9c19a07b2dad60de6f38ab26756e216bed5646bebd366
MD5 94bacf3604759456315b436db22fbdf2
BLAKE2b-256 5eb64fca469d54fcbe33114811f3c34f9f31874643bf2299120db3aa33935861

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page