A python package to parse PDBx file into Pandas DataFrames.
Project description
pdbx2df
Parse a PDBx file (mmCIF file: pdb_id.cif) into a python dict with PDBx category names as keys and contents belonging to the category as the corresponding values. Each category content is parsed as a Pandas DataFrame whose columns are the attribute names. On the other hand, we can write a dict of Pandas DataFrame(s) into a PDBx format in which the dict key(s) are used as category names, the DataFrame column names as attribute names, and the DataFrame row(s) as the corresponding record(s).
Also supports parsing a PDB file (pdb_id.pdb) into a python dict of Pandas DataFrames. Currently, only the lines starting with 'ATOM', 'HETATM', and 'TER' are read into a category named '_atom_site' which corresponds to the same category in a mmCIF file.
Requirements
- Pandas (>=1.0)
Install
pip install pdbx2df
Usage examples
- If you want to read the 3D coordinates for PDB
1vii
into a Pandas DataFrame, and you have downloaded the1vii.cif
file to your current working directory./
, you can:
from pdbx2df import read_pdbx
pdbx_file = './1vii.cif'
pdbx = read_pdbx(pdbx_file, category_names=['_atom_site'])
atoms_df = pdbx['_atom_site']
# 'atoms_df' is a Pandas DataFrame containing the '_atom_site' category which has the detailed 3D coordinates for each atom.
- If you want to read the FASTA sequence of
1vii
, you can:
from pdbx2df import read_pdbx
pdbx_file = './1vii.cif'
pdbx = read_pdbx(pdbx_file, category_names=['_entity_poly'])
fasta_df = pdbx['_entity_poly']
fasta = fasta_df['pdbx_seq_one_letter_code_can'].to_list()[0] # 1vii only has one sequence
# fasta == 'MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF'
- You can read them simutanously:
from pdbx2df import read_pdbx
pdbx_file = './1vii.cif'
pdbx = read_pdbx(pdbx_file, category_names=['_entity_poly', '_atom_site'])
atoms_df = pdbx['_atom_site']
fasta_df = pdbx['_entity_poly']
Putting a list of category names to category_names
, you will get them if they are in the PDBx file.
- You can parse the whole file by using 'all':
from pdbx2df import read_pdbx
pdbx_file = './1vii.cif'
pdbx = read_pdbx(pdbx_file, category_names=['all'])
atoms_df = pdbx['_atom_site']
fasta_df = pdbx['_entity_poly']
# and more
- Write back to a PDBx file:
from pdbx2df import read_pdbx, write_pdbx
pdbx_file = './1vii.cif'
pdbx = read_pdbx(pdbx_file, category_names=['all'])
keep = ['_atom_site', '_entity_poly'] # suppose we only want to keep the FASTA sequence and 3D coordinates.
pdbx_keep = {k: v for k, v in pdbx.items() if k in keep}
write_pdbx(pdbx_keep, '1vii_save.cif')
- For reading the atomic information in a PDB file
1vii.pdb
:
from pdbx2df import read_pdb
pdb_file = './1vii.pdb'
pdb = read_pdb(pdb_file, category_names=['_atom_site']) # We use '_atom_site' here to mirror the mmCIF format
atoms_df = pdb['_atom_site']
# 'atoms_df' is a Pandas DataFrame containing the '_atom_site' category which has the detailed 3D coordinates for each atom.
- Suppose we only want to keep the residue atoms in
5u8l.pdb
:
from pdbx2df import read_pdb, write_pdb
pdb_file = './5u8l.pdb'
pdb = read_pdb(pdb_file, category_names=['_atom_site'])
df = pdb['_atom_site']
df = df[df.record_name == 'ATOM']
pdb['_atom_site'] = df
write_pdb(pdb, '5u8l_nohetero.pdb')
# The '5u8l_nohetero.pdb' file contains only the protein residues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.