Skip to main content

mmCIF parser written in Nim with Python bindings

Project description

nim-mmcif

Fast mmCIF (Macromolecular Crystallographic Information File) parser written in Nim with Python bindings

The goal of this repository is to experiment with vibe coding while building something useful for bioinformatics community, to see how much of a cross platform library can be driven to completion by transformers

Verdict: I have upgraded to the Max 200$ plan. Opus is the only viable model, at least for me, and can be treated as a superhumanly fast but imperfect junior developer. With right prompting, it can be used to automate a lot of boring work and allow me to focus on the high level creative ones.

Features

  • 🚀 High-performance parsing of mmCIF files using Nim
  • 🌍 Cross-platform support (Linux, macOS, Windows)
  • 📦 Easy installation via pip

Installation

Prerequisites

From PyPI

pip install nim-mmcif

From Source

# Install Nim (platform-specific, see below)
# macOS: brew install nim
# Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Windows: scoop install nim

# Install the package
git clone https://github.com/lucidrains/nim-mmcif
cd nim-mmcif
pip install -e .

For detailed platform-specific instructions, see CROSS_PLATFORM.md.

Quick Start

Python Usage

Dictionary Access

from nim_mmcif import parse_mmcif

# Parse an mmCIF file (returns dict by default)
data = parse_mmcif("tests/test.mmcif")
print(f"Found {len(data['atoms'])} atoms")

# Access atom properties using dictionary notation
first_atom = data['atoms'][0]
print(f"Atom {first_atom['id']}: {first_atom['label_atom_id']}")
print(f"Position: ({first_atom['x']}, {first_atom['y']}, {first_atom['z']})")

# Parse multiple files using glob patterns
results = parse_mmcif("tests/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Dataclass Access

from nim_mmcif import parse_mmcif, parse_mmcif_batch

# Parse with dataclass support for cleaner dot notation access
data = parse_mmcif("tests/test.mmcif", as_dataclass=True)
print(f"Found {data.atom_count} atoms")

# Access atom properties using dot notation
first_atom = data.atoms[0]
print(f"Atom {first_atom.id}: {first_atom.label_atom_id}")
print(f"Position: ({first_atom.x}, {first_atom.y}, {first_atom.z})")
print(f"Chain: {first_atom.label_asym_id}, Residue: {first_atom.label_comp_id}")

# Use convenience properties and methods
print(f"Unique chains: {data.chains}")
print(f"Number of residues: {len(data.residues)}")

# Get all atoms from a specific chain
chain_a_atoms = data.get_chain('A')

# Get all atoms from a specific residue
residue_atoms = data.get_residue('A', 1)

# Get all positions as tuples
positions = data.positions  # List of (x, y, z) tuples

# Batch processing with dataclasses
results = parse_mmcif_batch(["tests/test1.mmcif", "tests/test2.mmcif"], as_dataclass=True)
for result in results:
    print(f"Structure has {result.atom_count} atoms in {len(result.chains)} chain(s)")

Other Functions

import nim_mmcif

# Get atom count directly
count = nim_mmcif.get_atom_count("tests/test.mmcif")
print(f"File contains {count} atoms")

# Get all atoms with their properties (returns list of dicts)
atoms = nim_mmcif.get_atoms("tests/test.mmcif")
for atom in atoms[:5]:  # Print first 5 atoms
    print(f"Atom {atom['id']}: {atom['label_atom_id']} at ({atom['x']}, {atom['y']}, {atom['z']})")

# Get just the 3D coordinates
positions = nim_mmcif.get_atom_positions("tests/test.mmcif")
for i, (x, y, z) in enumerate(positions[:5]):
    print(f"Position {i}: ({x:.3f}, {y:.3f}, {z:.3f})")

Nim Usage

First

$ nimble install nim_mmcif

Then

import nim_mmcif

# Parse an mmCIF file
let data = mmcif_parse("tests/test.mmcif")
echo "Found ", data.atoms.len, " atoms"

# Iterate through atoms
for atom in data.atoms[0..<min(5, data.atoms.len)]:
  echo "Atom ", atom.id, ": ", atom.label_atom_id, 
       " at (", atom.Cartn_x, ", ", atom.Cartn_y, ", ", atom.Cartn_z, ")"

# Access specific atom properties
if data.atoms.len > 0:
  let firstAtom = data.atoms[0]
  echo "Chain: ", firstAtom.label_asym_id
  echo "Residue: ", firstAtom.label_comp_id
  echo "B-factor: ", firstAtom.B_iso_or_equiv

Batch Processing

Process multiple mmCIF files efficiently in a single operation:

import nim_mmcif

# List of mmCIF files to process
files = [
    "path/to/structure1.mmcif",
    "path/to/structure2.mmcif",
    "path/to/structure3.mmcif"
]

# Parse all files in batch (returns list when no globs used)
results = nim_mmcif.parse_mmcif_batch(files)

# Process results
for i, data in enumerate(results):
    print(f"Structure {i+1}: {len(data['atoms'])} atoms")
    
    # Analyze each structure
    atoms = data['atoms']
    if atoms:
        # Get unique chain IDs
        chains = set(atom['label_asym_id'] for atom in atoms)
        print(f"  Chains: {', '.join(sorted(chains))}")
        
        # Count residues
        residues = set((atom['label_asym_id'], atom['label_seq_id']) 
                      for atom in atoms)
        print(f"  Residues: {len(residues)}")

# Batch processing with glob patterns (returns dict)
results = nim_mmcif.parse_mmcif_batch("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

# Mix of glob patterns and regular paths (returns dict)
results = nim_mmcif.parse_mmcif_batch([
    "specific_file.mmcif",
    "structures/*.mmcif",
    "models/model_?.mmcif"
])
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Batch processing is particularly useful when:

  • Analyzing multiple protein structures for comparative studies
  • Processing entire datasets of crystallographic structures
  • Building machine learning datasets from PDB files
  • Performing high-throughput structural analysis

The batch function provides better performance than individual parsing when processing multiple files, as it reduces the overhead of repeated function calls.

API Reference

Functions

parse_mmcif(filepath: str, as_dataclass: bool = False) -> dict | MmcifData | dict[str, dict] | dict[str, MmcifData]

Parse an mmCIF file or files matching a glob pattern.

  • filepath: Path to mmCIF file or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • Single file + dict: Dictionary with 'atoms' key
    • Single file + dataclass: MmcifData instance
    • Glob pattern + dict: Dictionary mapping file paths to parsed data
    • Glob pattern + dataclass: Dictionary mapping file paths to MmcifData instances
  • Supports wildcards: * (any characters), ? (single character), ** (recursive)

parse_mmcif_batch(filepaths: list[str] | str, as_dataclass: bool = False) -> list[dict] | list[MmcifData] | dict[str, dict] | dict[str, MmcifData]

Parse multiple mmCIF files in a single operation.

  • filepaths: List of paths, single path, or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • No glob + dict: List of dictionaries with parsed data
    • No glob + dataclass: List of MmcifData instances
    • With glob + dict: Dictionary mapping file paths to parsed data
    • With glob + dataclass: Dictionary mapping file paths to MmcifData instances
  • More efficient than parsing files individually when processing multiple structures

get_atom_count(filepath: str) -> int

Get the number of atoms in an mmCIF file.

get_atoms(filepath: str) -> list[dict]

Get all atoms from an mmCIF file as a list of dictionaries.

get_atom_positions(filepath: str) -> list[tuple[float, float, float]]

Get 3D coordinates of all atoms as a list of (x, y, z) tuples.

Dataclasses

MmcifData

Container for parsed mmCIF data with typed atom access.

Properties:

  • atoms: List of Atom objects
  • atom_count: Total number of atoms
  • positions: List of (x, y, z) tuples for all atoms
  • chains: Set of unique chain identifiers
  • residues: Set of unique (chain_id, seq_id) tuples

Methods:

  • get_chain(chain_id: str): Get all atoms from a specific chain
  • get_residue(chain_id: str, seq_id: int): Get all atoms from a specific residue
  • to_dict(): Convert back to dictionary format

Atom

Represents a single atom with typed properties accessible via dot notation.

Properties:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • type_symbol: Element symbol
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_entity_id: Entity ID
  • label_seq_id: Residue sequence number
  • Cartn_x, Cartn_y, Cartn_z: 3D coordinates
  • x, y, z: Convenient aliases for coordinates
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor (temperature factor)
  • position: Tuple of (x, y, z) coordinates

Methods:

  • to_dict(): Convert back to dictionary format

Dictionary Format

When using the default dictionary format (as_dataclass=False), each atom dictionary contains:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_seq_id: Residue sequence number
  • x, y, z: 3D coordinates (aliases for Cartn_x, Cartn_y, Cartn_z)
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor
  • And more...

Platform Support

Platform Architecture Python Status
Linux x64, ARM64 3.8-3.12
macOS x64, ARM64 3.8-3.12
Windows x64 3.8-3.12

Building from Source

Automatic Build

python build_nim.py

Manual Build

# Build using nimble tasks
nimble build         # Build debug version
nimble buildRelease  # Build optimized release version

Development

Running Tests

pip install pytest
pytest tests/ -v

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

Documentation

Performance

The Nim implementation provides significant performance improvements over pure Python parsers, especially for large mmCIF files commonly used in structural biology.

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with Nim for high performance
  • Python integration via nimporter and nimpy
  • mmCIF format specification from wwPDB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nim_mmcif-0.0.21.tar.gz (23.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nim_mmcif-0.0.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.21-cp312-cp312-macosx_11_0_arm64.whl (64.3 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nim_mmcif-0.0.21-cp312-cp312-macosx_10_9_x86_64.whl (63.5 kB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

nim_mmcif-0.0.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.21-cp311-cp311-macosx_11_0_arm64.whl (64.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nim_mmcif-0.0.21-cp311-cp311-macosx_10_9_x86_64.whl (63.5 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

nim_mmcif-0.0.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.21-cp310-cp310-macosx_11_0_arm64.whl (64.3 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nim_mmcif-0.0.21-cp310-cp310-macosx_10_9_x86_64.whl (63.5 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

nim_mmcif-0.0.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.4 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.21-cp39-cp39-macosx_11_0_arm64.whl (64.3 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nim_mmcif-0.0.21-cp39-cp39-macosx_10_9_x86_64.whl (63.5 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

nim_mmcif-0.0.21-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (20.0 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

nim_mmcif-0.0.21-cp38-cp38-macosx_11_0_arm64.whl (14.7 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nim_mmcif-0.0.21-cp38-cp38-macosx_10_9_x86_64.whl (14.3 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file nim_mmcif-0.0.21.tar.gz.

File metadata

  • Download URL: nim_mmcif-0.0.21.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nim_mmcif-0.0.21.tar.gz
Algorithm Hash digest
SHA256 701cd1d65f0a01a89af5e993d2d645ce9b2f82aa4758b0c41a5d195e3659e492
MD5 799b435e86672ba5affb5839e23991c3
BLAKE2b-256 49f9cbd4fefc8d291b708fbbab845707e500d28b404f70c42714ce6de15dd392

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ca933753fea5aac86012bb04c206091785b9d123ee4241d9f03e70ee4b185723
MD5 a2cd4cd81a1b080caa1b31523e00dcc2
BLAKE2b-256 b8ed1cae94a8799ca57971356b4e4b685ba8516c867673c60ba267e909348e5a

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3b7d4963ea058b5a8923c8de6864db4472e5768fbd8605ce071b6aa96f2d5f93
MD5 390e21936cd023ddb9886c5bf347fff9
BLAKE2b-256 54da8c9429650112b4748696c2f7df136659ebeb937f22dd27bc06b9ca153dc9

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 41946d7f26a099cf6e55aee5a89bc91fd90c8e0a52e2103b537622adfdf5d9f8
MD5 5cb87b2aa55bc662a4494621519ed5ba
BLAKE2b-256 63c7ba322b0c84b2558d98e5c719d09be8a129c089eb07d0ecaaecdada6c9661

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 452130b54acb7850c24535962ed16a3b7cf3c7daf2e3df4d64c88b7d92484ce1
MD5 17f7ee84b6a5485f43309b7d1d3bd243
BLAKE2b-256 490c827f7d9c8cccb6f0bf7ca22292b8a6410d23ef6640d1d70d1da2af7db7a2

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d3b9604ca33c4e7c693e764a373482ba1b6fbc68d00c9ca927495ffa88dc046d
MD5 8181d644b50c9eb6771951e1609a0fc8
BLAKE2b-256 438928881684853ac69842216d0ac39046bbf82b11cf5937e12204e9109fb5c4

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9d1c78f015c46bb2b8b8a3e0e012e7cfaf49e9d40ff571c6ecfb5e54ba2795de
MD5 e71c18e7abed6b20d517a4ce8c1e4f7d
BLAKE2b-256 ee66ff65ea109a04a6abb29c5c28edd0a31927214a76dbf8af6e980eb3d8c6f5

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 03a93d5a31080f162912e15ed04d172e5dbd3dc31ca5f222a2c0de30a7d079fb
MD5 c32474295d1ca0bcffbdddc2a316cbea
BLAKE2b-256 08862266ce9791b62450c82899361ea7ba0a7cc0f643d76948d2948c0eeab70d

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 77f755cd3027d5def4335cc389323f9328949e8ff1cad1becdc663df1f2d595d
MD5 9d31162eb00bf2c4d4fecee89c5cccb1
BLAKE2b-256 421f333b33253555e0c3142f55215d2aec117ced969557dce452d46c901f43bd

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c4fe0d876ab4d59b8155b6ec54edf4e4211f670ee47226f9bba091fa5239d70d
MD5 5891a16612f930b87175f7c26d2b18b7
BLAKE2b-256 9131723dc0ade7c0b6601fd7806dfdda6a7419e345ca1d0a8a2bb951bb017612

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 68ca95e52aed89097d6c71e2c3f0a3585559db046765f35e7fb72831e49f98e1
MD5 f7890cc9561083e46bb6324442057796
BLAKE2b-256 ca6732d5995faedd52febe344a376c79ea1d49dbffb8333b351dedd4fa4807a6

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9f9080484dcf6279b40055965134df8d1b6c029a8047be81ec2da4e0c10c8d29
MD5 6c341818ded80cb6818da0b7e77ac802
BLAKE2b-256 afeb240a01a6ac49c2a5846ac58f47034267cc901e2bf0331bd7aaa52a101c6c

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a74883c7836455e4034075da018f245bdb2effb9adc17f1e0b5af272dbe4ae77
MD5 e5f09f8234fcf9c3e62a10bdf08383d8
BLAKE2b-256 3bee8ec0e64d553f60761786c1a5cee65351d860a4aa20ce86973c20eedc106b

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c8e3981b50c83878e67fbd68c9a1288892c8ffce6880cec6cf1dc338956dfcb4
MD5 dce112522d9fdfc1e91c344bf5502690
BLAKE2b-256 3b2f81e758560d5ee8b036b22ccaa55923b73f85bec434ba32d10089d68afdd2

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2aeb76c107f1924ca9c58888e933cfcae3c471fb65b1230f916db8a4fa90311c
MD5 be38a78885fe11e559c485ff7face411
BLAKE2b-256 b7550d74daeefcbdcd49d0753b282291edc91ed193ea966724eb6a53c9f26385

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.21-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.21-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 75e871daf394331b97db3b30a0247c5b5f5d70cd15879457d3f339c11986cc03
MD5 2971903dd22c454f5278795d891cab48
BLAKE2b-256 2e378e2e80115b4bc9f825f8f43224677e1301ca01b48365c7533928a7d70561

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page