Skip to main content

Tools to analysis biology sequence

Project description

BioSequences


用于分析核酸与肽段序列

下载源码编译

python setup.py build_ext --inplace
rm ./build

pip安装

pip install biosequences

主要功能

bioseq.Sequence

bioseq.Sequence.Sequence(seq="")

  • RNA,DNA和Peptide都基于此抽象类,因此Sequence中的属性和方法为所有序列对象公有的属性和方法。
  • 相同的序列对象可以直接与同类对象或字符串进行拼接,比较。
  • 所有对象都不会对seq进行检查,所以构建对象时需要主要seq中不要出现不应该出现的字符,以免发生不必要的问题
from bioseq.sequence import DNA, RNA

d1 = DNA("ATCC")
d2 = DNA("AC")
r1 = Peptide("MATN")

d1  # 5'-ATCC-3'
r1  # N-MATN-C
d1 + d2  # 5'-ATCCAC-3'
d2 + d1  # 5'-ACATCC-3'
d1 + d2  # rasie TypeError(Attention: DNA can add RNA without T->U convert)
d1 == d2  # False

属性

seq

序列信息,不可修改(实际序列信息保存在内部属性_Seq中)

length

序列的长度

weight

序列的分子量

composition

序列中各个单位的含量

方法

align(subject, mode=1, return_score=False)
subject(str | Sequence):比对对象
mode(int):
  1 - 使用Needleman-Wunsch进行全局比对
  2 - 使用Smith-Waterman进行局部比对
return_score:是否返回匹配分数
find(target)

在序列中查找目标序列并返回所有匹配的起始位置

target(str| Sequence):目标序列
mutation(position, target)

改变序列信息

position(str | int | List[int]):修改位置的起始值或需要修改的字符串
target(str| Sequence):目标序列

bioseq.sequence.RNA

用于存储RNA序列信息。

属性

revered
返回序列的反向RNA序列
complemented
返回序列的反向互补RNA序列
GC
返回序列的GC含量
orf
序列中的开放读码框,使用过getOrf()方法后才具有此属性
peptide
序列转录产物,使用过tanscript()后才有此属性

方法

revers()

将序列自身变为其反向序列。注意:会修改序列自身

complemented()

将序列自身变为其反向互补序列。注意:会修改序列自身

getOrf(multi=False, replace=False)

获取序列上的ORF

multibool):是否查找所有frame +1~+3的orf设置为False则仅查找最长的orf
replacebool): 当multi=False时生效是否将最长的orf替换为原序列

transcript(filter=True)

将序列翻译为肽链

filter(bool)是否对翻译进行筛选设置为True时仅返回最长的翻译产物否则返回所有翻译产物翻译产物均为Peptide对象

bioseq.sequence.DNA

用于存储DNA序列信息。

方法

translate()

将DNA翻译为RNA对象并返回

transcript(filter = True)

将序列翻译为肽链

filter(bool)是否对翻译进行筛选设置为True时仅返回最长的翻译产物否则返回所有翻译产物翻译产物均为Peptide对象

bioseq.sequence.Peptide

用于存储肽链序列信息。

Peptide

属性

pl

基于EMBOSS数据库中氨基酸的pK值, 计算该肽链序列的等电点并返回

方法

chargeInpH(pH)

基于EMBOSS数据库中氨基酸的pK值,计算肽链在某一pH下所带的电荷量

getHphob(window_size=9)

基于Doolittle(1982)的氨基酸疏水性数据,计算肽链的疏水性,疏水性

window_size(int):某一氨基酸的疏水性为window_size内该氨基酸位于window中心时的所有氨基酸疏水性的平均值
返回值为各个氨基酸的疏水性列表,可直接使用plt.plot(result)进行绘制

bioseq.config

可在此文件中直接修改配置数据,或通过以下函数在运行时修改部分数据

setAlignPara(match = 2, mismatch = -3, gap_open = -3, gap_extend = -3)

修改序列比对时的评分规则,需要在比对前进行设置

match(int) 匹配得分>0
mismath(int)错配得分<0
gap_open(int)开口得分<0
gap_extend(int)开口延长得分<0 

d1 = DNA("ATCTCGC")
d2 = DNA("ATCCC")

print(d1.align(d2, return_score = True))	#('ATCTCGC', 'ATC-C-C', 4.0)
setAlignPara(5)
print(d1.align(d2, return_score = True))	#('ATCTCGC', 'A--TCCC', -0.5)

setStartCoden(coden)

修改核酸序列转录时需要的起始密码子

coden(str | List(str)):密码子会在coden中寻找,如有匹配则开始进行转录

d1 = DNA("ATCATCTCAGCATGAC")

print(d1.transcript(filter=False))	# []
setStartCoden(["AUC"])
print(d1.transcript(filter=False))	# [N-IISA-C, N-ISA-C]

bioseq.utils

工具

printAlign(sequence1, sequence2, spacing=10, line_width=30, show_seq=True)

在命令行中按格式输出两个比对后的序列, 可在config.SYMBOL中修改显示的符号

spacing(int)序列显示间隔
line_width(int)每行显示的字符数
show_sequence(bool)是否显示序列

d1 = DNA("ATCATCTCAGCATGAC")
d2 = DNA("ATCATCGCATGAC")

seq1, seq2 = d1.align(d2)
printAlign(d1, d2)
#    1 ATCATCTCAG CAT
#      ┃┃┃┃┃┃•┃┃• •┃•
#    1 ATCATCGCAT GAC
printAlign(d1, d2, spacing=3, line_width=10, show_seq=False)
#    1 ┃┃┃ ┃┃┃ •┃┃ •
# 
#   11 •┃• 

read_fasta(filename)

读取fasta文件,并返回所有读取到的(序列列表,序列名列表)Todo:加入更多解析格式

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BioSequences-1.0.6.tar.gz (15.7 kB view details)

Uploaded Source

Built Distributions

BioSequences-1.0.6-cp38-cp38-win_amd64.whl (19.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

File details

Details for the file BioSequences-1.0.6.tar.gz.

File metadata

  • Download URL: BioSequences-1.0.6.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.6

File hashes

Hashes for BioSequences-1.0.6.tar.gz
Algorithm Hash digest
SHA256 ba788c494e0ac8b6387dbd4c472090028be2b72d8d74f22ff828a73962eb7cca
MD5 1d0567b00093e3262a07eecd593e7931
BLAKE2b-256 618314bf341c6eb58d72c25a5e69899305948dff0c88ae4ec1aa0c9faad43200

See more details on using hashes here.

File details

Details for the file BioSequences-1.0.6-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: BioSequences-1.0.6-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.6

File hashes

Hashes for BioSequences-1.0.6-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 8ad286116562144b8518bb371f508bb0adce284e461c7edf12fab8251e366bb8
MD5 d8ff90058c157ad9df49c8de3931c1b0
BLAKE2b-256 ab2a8c96794130624ef7725f414d30e60386c5579788994d8cf339a9193a7c8d

See more details on using hashes here.

File details

Details for the file BioSequences-1.0.6-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: BioSequences-1.0.6-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10

File hashes

Hashes for BioSequences-1.0.6-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 31fffee825aefa9e59f59165438171e00b064bebba0e6ae28f9b0c7ec040afff
MD5 66a5f68cd8ab69c5b9dc6bd5f9fc2d15
BLAKE2b-256 0dbb0946e70e642cacfe1d9591673a2cc1e5be9c9ed8ccd9f88181b70e31f48e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page