'Extending pyproteins for bioinformatics tools&services'

Project description

pyproteinsext

Documentation

notebook test&examples

TO DO

Generic container modules

TO DO

Anotation modules

TO DO

Specific container modules

Multiple Sequence Alignment

Reading a file

import pyproteins.sequence.msa as msaLib
oMsa = msaLib.Msa(fileName="/Users/guillaumelaunay/work/projects/MSA/clustalw.aln")

Accessing sequences

print(oMsa[0])

will display

{'header': 'sp|Q5SJH5|RIMM_THET8',
'sequence':'------MRLVEIGRFGAPYALKGGLRF--RGEP---VVLHLER----'}

Accessing matching sequences

Retrieve a sequence in the msa, by specifying a predicate function or a regular expression. The predicate function will be applied to each record in turn, it will be passed a dictionary with the index and record keys. Records matching the lookup will be returned in a list with sequence gap striped.

indexed-based search

def f(d):
    return d["index"] < 10
recordList = oMsa.recordLookup(predicate=f)
print(len(recordList))
print(recordList[0])

will display

10
{'header': 'sp|Q5SJH5|RIMM_THET8', 'sequence': 'MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGEELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVDILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD'}

header content based search

def g(d):
    return re.search("THET", d["record"]['header'])
recordList = oMsa.recordLookup(predicate=g)
print(len(recordList))
print(recordList[0])

will display

3
{'header': 'sp|Q5SJH5|RIMM_THET8', 'sequence': 'MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGEELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVDILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD'}

Transformations

The following methods all return a new Alignment object.

Column deletions

Purging gap

Specify a treshold of gap frequencies to filter out columns, default value is 0.5

oMsa.gapPurge(gapRatio=0.5)

sequence based masking

Use any sequence of the alignment to delete all columns where this sequence features a gap. Default master sequence is number 0.

oMsa.maskMaster(self, masterIndex=0)

sequence based filtering

Sequence within a MSA can be filtered according to their relationships with a master sequence. The predicate function will be applied to all sequence in turn. Predicate will be passed 3 arguments :

a 3-tuple statistic wrapping (sequence identity, sequence similarity, sequence coverage) of the master with respect to current sequence.
the master sequence object
the current sequence object

Returned object is a MSA of at least one sequence (the master).

An optional named masterIndex can be pass to use an alternative sequence as reference.

# Defining predicate, here minimal coverave of 85%
def f(stat, iSeq, jSeq):
    return stat[2] > 0.85
bMsa = oMsa.masterFilter(predicate=f)
print(bMsa.fastaDump())

will print,

>sp|Q5SJH5|RIMM_THET8
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGE
ELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVD
ILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD
>tr|H9ZRG5|H9ZRG5_THETH
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGE
ELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVD
ILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGGIHVDPIPGLFD
>tr|E8PJQ1|E8PJQ1_THESS
MGLWHNGLGMRLVEIGRFGAPYALRGGLKFRGEPVVAHLERVYVEGHGWRAVEDLYQVGD
DLVVHLAGVSSRELAEPLVGLRVYAEVEELPPLEEGRYYYFALIGLPVYVGGLKMGEVVD
ILDAGAQDVLVIRGVGERLRDQTERLVPLQAPYVRVEEEGIHVEPIPGLFD
>tr|B7A7I3|B7A7I3_THEAQ
-------MAGRLVEIGRFGAPYALAGGLKFRGEPVVAHLTRIYVEGHGWRAVEDLYQVGE
ELVVHLAGVSTRELAEALVGLRVYAEVADLPPLEEGQYYYFALIGLPVYVEGQKVGEVAD
ILDAGAQDVLVIRGVGERLRDRAERLVPLQAPYVRVEAEGIHVEPIPGLFD
>tr|K7QWL8|K7QWL8_THEOS
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGFRAVEDLYRVGE
VLILHLAGVSTRELAEALVGLRVYAEVEDLPPLEEGQYYYFALVGLPVYVGEEQVGEVAD
ILDAGAQDVLVIRGIGERLRDQRERLVPLQAPYVTVEEGRILVEPIPGLFD

free slicing

TO DO

Meta-data and statistics

oMsa.shape: [sequenceNumber, columnNumber]

PDB container

Protein data container

Load a PDB file

import pyproteinsext.structure.coordinates as PDB
parser = PDB.Parser()
pdbObj = parser.load(file="./1syq.pdb")

Display SEQRES

pdbObj.SEQRES["A"]

A is chain name.

Aligning SEQRES and vald ATOM RECORD

Create wrapper peptide object

import pyproteins.sequence.peptide as pep
p1 = {'id' : "SEQRES",
    'desc' : 'pdb file fasta translation',
    'seq' : pdbObj.SEQRES["A"]
}
pepSeqRes = pep.Entry(p1)
pepCoor = pep.Entry(pdbObj.chain("A").peptideSeed())

Align "peptide" sequences

import pyproteins.alignment.nw_custom as N
import pyproteins.alignment.scoringFunctions as scoringFunctions
blosum = scoringFunctions.Needle().fScore
nw = N.nw(gapOpen=-10, gapExtend=-0.5, matchScorer=blosum)
aliResObj = nw.align(pepPDB, pepUniProt)
print(aliResObj)

Example with a sequence peptide from to a PDB file. In this illustration, PDB file is 2vkn.

#parsing PDB
import pyproteinsext.structure.coordinates as PDB
parser = PDB.Parser()
pdbObj = parser.load(file="path/2ns7.pdb")
pdbObj.SEQRES

It’s possible to create a tuple with AA name and his position ; with command :

import pyproteins.sequence.peptide as pep
AApdb = [(pep.threeToOne(aa.name), int(aa.num)) for aa in pdbObj.byres()]

threeToOne is a translater of AA name code at 3 letters to AA name code at 2 letters.

Uniprot container

You can access the content of any uniprot element. Corresponding XML file we ll be download locally if needed in a user defined cache directory.

import pyproteinsext.uniprot as uniprot
uniprot.proxySetting(https="https://yourproxy:port", http="http://yourproxy:port")

uColl = uniprot.getUniprotCollection()
uColl.setCache(location='/Users/guillaumelaunay/work/data/uniprot')
uniprot.getPfamCollection().setCache(location='/Users/guillaumelaunay/work/data/pfam')

obj=uColl.get("P98160")


print(obj.GO)
print("\n")
print(obj.DI)
print("\n")
print(obj.peptideSeed())
print("\n")
print(obj.fasta)

will print

[GO:0005605:C:basal lamina{ECO:0000501}, ...]

[DI-02288:Schwartz-Jampel syndrome (SJS1) {Rare autosomal recessive disorder
characterized by permanent myotonia (prolonged failure of muscle relaxation)
and skeletal dysplasia, resulting in reduced stature, kyphoscoliosis,
bowing of the diaphyses and irregular epiphyses.},
...]

{'id': 'P98160', 'desc': 'PGBM_HUMAN', 'seq': 'MGWRAAGALLLALLLHGRLLAVTHGLRAYDGLSLPEDIETVTA...}

>P98160 PGBM_HUMAN
MGWRAAGALLLALLLHGRLLAVTHGLRAYDGLSLPED...

HMMR results container

You can give one or several file to parse method. For each protein, an entry hmmrObj is created

import pyproteinsext.hmmrContainer as hm
hmmrContainer = hm.parse('hmmsearch_A.out', 'hmmsearch_B.out')

hmmrObj attributes and properties:

prot : protein name
domain : domain name
hit : dictionnary that contains hmmr hit informations, like score, evalue, alignment positions...
sequence : give protein sequence that corresponds to domain
start : give domain start in protein
end : give domain end in protein

for e in hmmrContainer:
    print(e.prot)
    print(e.domain)
    print(e.hit)
    print(e.sequence)
    print(e.start)
    print(e.end)

will print

tr|A0A1Q6ZBW6|A0A1Q6ZBW6_9ARCH
PF08022_full
{'hmmID': 'PF08022_full', 'aliID': 'tr|A0A1Q6ZBW6|A0A1Q6ZBW6_9ARCH', 'header': '1  score: 16.0 bits;  conditional E-value: 0.00014', 'score': '16.0', 'bias': '0.0', 'cEvalue': '0.00014', 'iEvalue': '0.24', 'hmmFrom': '26', 'hmmTo': '102', 'aliFrom': '37', 'aliTo': '99', 'envFrom': '27', 'envTo': '102', 'acc': '0.89', 'hmmStringLetters': 'fkwkpGqhvylsvpsisklllesHPFtiasapekddelslvirarggwtkrlaelaekseaesksklkvlieGPYGa', 'matchString': ' +++pGq+++++vp + ++     P ++  ++  ++e  +vi++ g ++++l e+              ++ GPYG+', 'aliStringLetters': 'HQANPGQFAMVWVPGVDEV-----PMSVLAIHG-KSEAGVVIKKGGPVSTALWEKKVGD--------IFFVRGPYGH', 'hmmSymbolStuff': {'CS': 'XXXXXXXXXXXXXXXXXXX--XXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX--TTSTTSH'}, 'aliSymbolStuff': {'PP': '5789*************88.....******999.******************9976555........69*******6'}}
HQANPGQFAMVWVPGVDEVPMSVLAIHGKSEAGVVIKKGGPVSTALWEKKVGDIFFVRGPYGH
37
99
...

TMHMM results container

Container that store TMHMM results. You can only give one tmhmm result file to parse (for now).

import pyproteinsext.tmhmmContainerFactory as tmhmm
tmhmmContainer = tmhmm.parse('tmhmm.out')

Container is a collection of TMHMM_Obj objects.

TMHMM_Obj attributes and properties:

prot : protein name
prot_length : protein length
nb_helix : number of predicted helixes
fragments : List of Fragment object. Each fragment have attributes cellular_location, start and end.
topology_seq : protein sequence with 'o' for outside loop, 'i' for inside loop and helix number for helixes. WARNING : doesn't work if there are more than 9 helixes (use 2 characters instead of 1)

for e in tmhmmContainer:
    print(e.prot)
    print(e.prot_length)
    print(e.nb_helix)
    print(len(e.fragments),"fragments")
    for f in e.fragments:
        print(f.cellular_location, f.start, f.end)

will print

tr|A0A2E0DFV0|A0A2E0DFV0_9EURY
155
4
9 fragments
outside 1 5
TMhelix 6 25
inside 26 53
TMhelix 54 76
outside 77 90
TMhelix 91 109
inside 110 129
TMhelix 130 152
outside 153 155
...

Pfam container

Protein-Protein interaction containers

psicquicData container API

Descriptions of the Psicquic service and related MITAB format can be found here

import pyproteinsext.psicquic as psq
psqObj = psq.PSICQUIC(offLine=True)
psqObj.read(mitabFile)
psqAllInR6 = psqObj.filter(predicate=f)

mitabObject API

A mitabObject stores a set of psicquic.PSQDATA objects. Each psicquic.PSQDATA object handles one mitab record. Managment of the psicquic.PSQDATA set is done through the pyproteins.container.Core.dnTree container model.

In practice, a mitabObject can be used to

List of all interactors

mitabTopologyObject.keys()

Obtain a one-level dictionary of all the partnairs of a query protein along with their list of psicquic.PSQDATA

mitabTopologyObject["P38801"]

Obtain all the psicquic.PSQDATA of a specific pair of proteins

mitabTopologyObject["P38801"]["P24783"]

[uniprotkb:P24783	uniprotkb:P38801	intact:EBI-5602|uniprotkb:Q05456|uniprotkb:D6W169	intact:EBI-1909|uniprotkb:D3DL32	psi-mi:dbp2_yeast(display_long)|uniprotkb:DBP2(gene name)|psi-mi:DBP2(display_short)|uniprotkb:YNL112W(locus name)|uniprotkb:N1945(orf name)|uniprotkb:DEAD box protein 2(gene name synonym)|uniprotkb:p68-like protein(gene name synonym)	psi-mi:lrp1_yeast(display_long)|uniprotkb:LRP1(gene name)|psi-mi:LRP1(display_short)|uniprotkb:Like an rRNA processing protein 1(gene name synonym)|uniprotkb:rRNA processing protein 47(gene name synonym)|uniprotkb:RRP47(gene name synonym)|uniprotkb:YC1D(gene name synonym)|uniprotkb:Yeast C1D domain-containing protein(gene name synonym)|uniprotkb:YHR081W(locus name)	psi-mi:"MI:0111"(dihydrofolate reductase reconstruction)	Tarassov et al. (2008)	pubmed:18467557|mint:MINT-6673767|imex:IM-14275	taxid:559292(yeast)|taxid:559292(Saccharomyces cerevisiae)	taxid:559292(yeast)|taxid:559292(Saccharomyces cerevisiae)	psi-mi:"MI:0915"(physical association)	psi-mi:"MI:0471"(MINT)	intact:EBI-6319091|imex:IM-14275-472	author score:99.00|intact-miscore:0.37]

The key order does not matter.

Database modules

DB FS

A library to build and query large database using the file system.

Building a database

Fetch desired multifasta gziped file database

Split this file into smaller zip file

python uniprotFastaFS.py ~/tmp/vUniprot --cluster uniprot_sprot_2018_11.fasta.gz

Several node_*.gz multifasta files were created.

Index the node files

Create a node folder foreach node file, uses the file system architecture to index their content. The --nodes argument is a regexp to specify the node number(s) to index.

python uniprotFastaFS.py ~/tmp/vUniprot --nodes '*'

All indexation processes are performed in parrallel and the resulting subfolder organisations are independant. By independant folder architecture, we mean, as illustrated in the exemple below, that node subfolders (eg:node_1 and node_2) may present similar prefix have subfolders (eg:A,B).

vUniprot
    |____node_1
    | |____A
    | | |____index.txt
    | | |____data.gz
    | |____B
    | | |____index.txt
    | | |____data.gz
    |____node_2
    | |____A
    | | |____index.txt
    | | |____data.gz
    | |____B
    | | |____index.txt
    | | |____data.gz

Querying a database

You can fecth a fasta entry the following way, optionally dumping it to file.

python uniprotFastaFS.py ~/tmp/vUniprot --get P98160

Please consult CLI help for additional options

Project details

Release history Release notifications | RSS feed

This version

3.1.7

Jun 9, 2026

3.1.6

Dec 13, 2024

3.1.5

Dec 13, 2024

3.1.4

Dec 13, 2024

3.1.3

Oct 24, 2023

3.1.2

May 2, 2023

3.1.1

Mar 22, 2023

3.1.0

Feb 16, 2023

3.0.3

Feb 11, 2022

3.0.1

Feb 11, 2022

3.0.0

Feb 10, 2022

2.3.4

May 18, 2021

2.3.3

May 7, 2021

2.3.2

May 7, 2021

2.3.1

May 7, 2021

2.3

May 7, 2021

2.2

May 7, 2021

2.1

Apr 30, 2021

1.9

Apr 30, 2021

1.8

Mar 17, 2020

1.7

Feb 6, 2020

1.6

Jul 26, 2019

1.5

May 2, 2019

1.3

Feb 21, 2019

1.2

Feb 21, 2019

1.1

Feb 21, 2019

1.0

Jan 31, 2019

0.38

Jul 18, 2017

0.37

Mar 28, 2017

0.36

Mar 28, 2017

0.35

Mar 27, 2017

0.34

Mar 14, 2017

0.31

Mar 13, 2017

0.30

Mar 13, 2017

0.26

Sep 1, 2016

0.25

Sep 1, 2016

0.24

Sep 1, 2016

0.23

Sep 1, 2016

0.22

Jul 19, 2016

0.21

Jun 17, 2016

0.4

Sep 26, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyproteinsext-3.1.7.tar.gz (2.3 MB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyproteinsext-3.1.7-py3-none-any.whl (2.3 MB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file pyproteinsext-3.1.7.tar.gz.

File metadata

Download URL: pyproteinsext-3.1.7.tar.gz
Upload date: Jun 9, 2026
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.3

File hashes

Hashes for pyproteinsext-3.1.7.tar.gz
Algorithm	Hash digest
SHA256	`c026489d8055dada59d8df2d49686abff42fe5bd7d9a4ac704587b0cab409b8c`
MD5	`06baef9f5a73b8aae7704553246eac9c`
BLAKE2b-256	`321bb488ac9b9cd8316af112aac22adaec693b490bf3d5aaa89fe5e6b3e13f42`

See more details on using hashes here.

File details

Details for the file pyproteinsext-3.1.7-py3-none-any.whl.

File metadata

Download URL: pyproteinsext-3.1.7-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 2.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.3

File hashes

Hashes for pyproteinsext-3.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbc789eef1d5d1667609aa9c46651d795c761ff6ff6c878dd6c91256782058f8`
MD5	`e23855742ae5bea0eb7c1c2d8c9b889e`
BLAKE2b-256	`046824369843af9de5e610ca0928549018b1ef34e86b7dffc1aa37306d870d90`

See more details on using hashes here.

pyproteinsext 3.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pyproteinsext

notebook test&examples

Generic container modules

Anotation modules

Specific container modules

Multiple Sequence Alignment

Reading a file

Accessing sequences

Accessing matching sequences

indexed-based search

header content based search

Transformations

Column deletions

Purging gap

sequence based masking

sequence based filtering

free slicing

Meta-data and statistics

PDB container

Protein data container

Load a PDB file

Display SEQRES

Aligning SEQRES and vald ATOM RECORD

Create wrapper peptide object

Align "peptide" sequences

Uniprot container

HMMR results container

TMHMM results container

Pfam container

Protein-Protein interaction containers

psicquicData container API

mitabObject API

List of all interactors

Obtain a one-level dictionary of all the partnairs of a query protein along with their list of psicquic.PSQDATA

Obtain all the psicquic.PSQDATA of a specific pair of proteins

Database modules

DB FS

Building a database

Fetch desired multifasta gziped file database

Split this file into smaller zip file

Index the node files

Querying a database

Please consult CLI help for additional options

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes