Skip to main content

mmCIF parser written in Nim with Python bindings

Project description

nim-mmcif

Fast mmCIF (Macromolecular Crystallographic Information File) parser written in Nim with Python bindings

The goal of this repository is to experiment with vibe coding while building something useful for bioinformatics community, to see how much of a cross platform library can be driven to completion by transformers

Features

  • 🚀 High-performance parsing of mmCIF files using Nim
  • 🌍 Cross-platform support (Linux, macOS, Windows)
  • 📦 Easy installation via pip

Installation

Prerequisites

From PyPI

pip install nim-mmcif

From Source

# Install Nim (platform-specific, see below)
# macOS: brew install nim
# Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Windows: scoop install nim

# Install the package
git clone https://github.com/lucidrains/nim-mmcif
cd nim-mmcif
pip install -e .

For detailed platform-specific instructions, see CROSS_PLATFORM.md.

Quick Start

Python Usage

Dictionary Access

import nim_mmcif

# Parse an mmCIF file (returns dict by default)
data = nim_mmcif.parse_mmcif("path/to/file.mmcif")
print(f"Found {len(data['atoms'])} atoms")

# Access atom properties using dictionary notation
first_atom = data['atoms'][0]
print(f"Atom {first_atom['id']}: {first_atom['label_atom_id']}")
print(f"Position: ({first_atom['x']}, {first_atom['y']}, {first_atom['z']})")

# Parse multiple files using glob patterns
results = nim_mmcif.parse_mmcif("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Dataclass Access

import nim_mmcif

# Parse with dataclass support for cleaner dot notation access
data = nim_mmcif.parse_mmcif("path/to/file.mmcif", as_dataclass=True)
print(f"Found {data.atom_count} atoms")

# Access atom properties using dot notation
first_atom = data.atoms[0]
print(f"Atom {first_atom.id}: {first_atom.label_atom_id}")
print(f"Position: ({first_atom.x}, {first_atom.y}, {first_atom.z})")
print(f"Chain: {first_atom.label_asym_id}, Residue: {first_atom.label_comp_id}")

# Use convenience properties and methods
print(f"Unique chains: {data.chains}")
print(f"Number of residues: {len(data.residues)}")

# Get all atoms from a specific chain
chain_a_atoms = data.get_chain('A')

# Get all atoms from a specific residue
residue_atoms = data.get_residue('A', 1)

# Get all positions as tuples
positions = data.positions  # List of (x, y, z) tuples

# Batch processing with dataclasses
results = nim_mmcif.parse_mmcif_batch(["file1.mmcif", "file2.mmcif"], as_dataclass=True)
for result in results:
    print(f"Structure has {result.atom_count} atoms in {len(result.chains)} chain(s)")

Other Functions

# Get atom count directly
count = nim_mmcif.get_atom_count("path/to/file.mmcif")
print(f"File contains {count} atoms")

# Get all atoms with their properties (returns list of dicts)
atoms = nim_mmcif.get_atoms("path/to/file.mmcif")
for atom in atoms[:5]:  # Print first 5 atoms
    print(f"Atom {atom['id']}: {atom['label_atom_id']} at ({atom['x']}, {atom['y']}, {atom['z']})")

# Get just the 3D coordinates
positions = nim_mmcif.get_atom_positions("path/to/file.mmcif")
for i, (x, y, z) in enumerate(positions[:5]):
    print(f"Position {i}: ({x:.3f}, {y:.3f}, {z:.3f})")

Nim Usage

First

$ nimble install nim_mmcif

Then

import nim_mmcif

# Parse an mmCIF file
let data = mmcif_parse("path/to/file.mmcif")
echo "Found ", data.atoms.len, " atoms"

# Iterate through atoms
for atom in data.atoms[0..<min(5, data.atoms.len)]:
  echo "Atom ", atom.id, ": ", atom.label_atom_id, 
       " at (", atom.Cartn_x, ", ", atom.Cartn_y, ", ", atom.Cartn_z, ")"

# Access specific atom properties
if data.atoms.len > 0:
  let firstAtom = data.atoms[0]
  echo "Chain: ", firstAtom.label_asym_id
  echo "Residue: ", firstAtom.label_comp_id
  echo "B-factor: ", firstAtom.B_iso_or_equiv

Batch Processing

Process multiple mmCIF files efficiently in a single operation:

import nim_mmcif

# List of mmCIF files to process
files = [
    "path/to/structure1.mmcif",
    "path/to/structure2.mmcif",
    "path/to/structure3.mmcif"
]

# Parse all files in batch (returns list when no globs used)
results = nim_mmcif.parse_mmcif_batch(files)

# Process results
for i, data in enumerate(results):
    print(f"Structure {i+1}: {len(data['atoms'])} atoms")
    
    # Analyze each structure
    atoms = data['atoms']
    if atoms:
        # Get unique chain IDs
        chains = set(atom['label_asym_id'] for atom in atoms)
        print(f"  Chains: {', '.join(sorted(chains))}")
        
        # Count residues
        residues = set((atom['label_asym_id'], atom['label_seq_id']) 
                      for atom in atoms)
        print(f"  Residues: {len(residues)}")

# Batch processing with glob patterns (returns dict)
results = nim_mmcif.parse_mmcif_batch("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

# Mix of glob patterns and regular paths (returns dict)
results = nim_mmcif.parse_mmcif_batch([
    "specific_file.mmcif",
    "structures/*.mmcif",
    "models/model_?.mmcif"
])
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Batch processing is particularly useful when:

  • Analyzing multiple protein structures for comparative studies
  • Processing entire datasets of crystallographic structures
  • Building machine learning datasets from PDB files
  • Performing high-throughput structural analysis

The batch function provides better performance than individual parsing when processing multiple files, as it reduces the overhead of repeated function calls.

API Reference

Functions

parse_mmcif(filepath: str, as_dataclass: bool = False) -> dict | MmcifData | dict[str, dict] | dict[str, MmcifData]

Parse an mmCIF file or files matching a glob pattern.

  • filepath: Path to mmCIF file or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • Single file + dict: Dictionary with 'atoms' key
    • Single file + dataclass: MmcifData instance
    • Glob pattern + dict: Dictionary mapping file paths to parsed data
    • Glob pattern + dataclass: Dictionary mapping file paths to MmcifData instances
  • Supports wildcards: * (any characters), ? (single character), ** (recursive)

parse_mmcif_batch(filepaths: list[str] | str, as_dataclass: bool = False) -> list[dict] | list[MmcifData] | dict[str, dict] | dict[str, MmcifData]

Parse multiple mmCIF files in a single operation.

  • filepaths: List of paths, single path, or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • No glob + dict: List of dictionaries with parsed data
    • No glob + dataclass: List of MmcifData instances
    • With glob + dict: Dictionary mapping file paths to parsed data
    • With glob + dataclass: Dictionary mapping file paths to MmcifData instances
  • More efficient than parsing files individually when processing multiple structures

get_atom_count(filepath: str) -> int

Get the number of atoms in an mmCIF file.

get_atoms(filepath: str) -> list[dict]

Get all atoms from an mmCIF file as a list of dictionaries.

get_atom_positions(filepath: str) -> list[tuple[float, float, float]]

Get 3D coordinates of all atoms as a list of (x, y, z) tuples.

Dataclasses

MmcifData

Container for parsed mmCIF data with typed atom access.

Properties:

  • atoms: List of Atom objects
  • atom_count: Total number of atoms
  • positions: List of (x, y, z) tuples for all atoms
  • chains: Set of unique chain identifiers
  • residues: Set of unique (chain_id, seq_id) tuples

Methods:

  • get_chain(chain_id: str): Get all atoms from a specific chain
  • get_residue(chain_id: str, seq_id: int): Get all atoms from a specific residue
  • to_dict(): Convert back to dictionary format

Atom

Represents a single atom with typed properties accessible via dot notation.

Properties:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • type_symbol: Element symbol
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_entity_id: Entity ID
  • label_seq_id: Residue sequence number
  • Cartn_x, Cartn_y, Cartn_z: 3D coordinates
  • x, y, z: Convenient aliases for coordinates
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor (temperature factor)
  • position: Tuple of (x, y, z) coordinates

Methods:

  • to_dict(): Convert back to dictionary format

Dictionary Format

When using the default dictionary format (as_dataclass=False), each atom dictionary contains:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_seq_id: Residue sequence number
  • x, y, z: 3D coordinates (aliases for Cartn_x, Cartn_y, Cartn_z)
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor
  • And more...

Platform Support

Platform Architecture Python Status
Linux x64, ARM64 3.8-3.12
macOS x64, ARM64 3.8-3.12
Windows x64 3.8-3.12

Building from Source

Automatic Build

python build_nim.py

Manual Build

# Build using nimble tasks
nimble build         # Build debug version
nimble buildRelease  # Build optimized release version

Development

Running Tests

pip install pytest
pytest tests/ -v

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

Documentation

Performance

The Nim implementation provides significant performance improvements over pure Python parsers, especially for large mmCIF files commonly used in structural biology.

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with Nim for high performance
  • Python integration via nimporter and nimpy
  • mmCIF format specification from wwPDB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nim_mmcif-0.0.19.tar.gz (23.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nim_mmcif-0.0.19-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (91.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.19-cp312-cp312-macosx_11_0_arm64.whl (64.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nim_mmcif-0.0.19-cp312-cp312-macosx_10_9_x86_64.whl (64.0 kB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

nim_mmcif-0.0.19-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.19-cp311-cp311-macosx_11_0_arm64.whl (64.2 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nim_mmcif-0.0.19-cp311-cp311-macosx_10_9_x86_64.whl (64.0 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

nim_mmcif-0.0.19-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.19-cp310-cp310-macosx_11_0_arm64.whl (64.2 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nim_mmcif-0.0.19-cp310-cp310-macosx_10_9_x86_64.whl (64.0 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

nim_mmcif-0.0.19-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (90.8 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.19-cp39-cp39-macosx_11_0_arm64.whl (64.2 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nim_mmcif-0.0.19-cp39-cp39-macosx_10_9_x86_64.whl (64.0 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

nim_mmcif-0.0.19-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (20.3 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

nim_mmcif-0.0.19-cp38-cp38-macosx_11_0_arm64.whl (15.1 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nim_mmcif-0.0.19-cp38-cp38-macosx_10_9_x86_64.whl (14.7 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file nim_mmcif-0.0.19.tar.gz.

File metadata

  • Download URL: nim_mmcif-0.0.19.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nim_mmcif-0.0.19.tar.gz
Algorithm Hash digest
SHA256 d15d295638a3bc5caf9751f7bcc6a0ca343507ba857326b0c929e71e5f68d1b9
MD5 552e895f88732ab75af3dad005304482
BLAKE2b-256 a44926873d47e28c4c77baf8d99b49fd4dddbedd306145221b559e196e59f1db

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c048f7fff803229eb01b25d5e0209fa976bf0c01b13082373d1550b4ed5fdf19
MD5 c7ee908b055e2ddbc977ed8572029d26
BLAKE2b-256 6a55d59132ed847e1e9a790bf34a5696ee9510cc751dc8a02abb5e1f2feab726

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 22df5f0131fba4e3cfa7eaac8fc630145f904564799f402afa7e39539dc3311a
MD5 4d646069fa260467ec136283d2051966
BLAKE2b-256 bcad6a62afbc03ec15a6eaf4df9f7c56de893a5b1b50cb2cd91100df71c432d2

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 37a124e4e77855cd3f388d98fa35ca946bd82039bfc00418606a91dc943ce6e0
MD5 59bf43b817b00b1145917e33a90cd8df
BLAKE2b-256 ae20a8288c1db752b3bf3d913965b9fbc0e0a5eba1cf9959378039a7c9c683a0

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c5c0d52d640a32a9f1de658f132f4c941f0b10fc742c3f52bcff6643edbce6d6
MD5 f3501b820f99d4820f266b0a2546c3c0
BLAKE2b-256 cd690f33249b4f427a63bcec755b1bc2c4161442bbb32d3bc77c82aded202864

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ad7d5f9e1b2fc80e0a6e1fc17684a84b2193abd3bade1c4f786e15e3d8ae0601
MD5 c019f91f8e865d49e594a8cb8eef6787
BLAKE2b-256 7d9ef3fb465b89b4be4385c732eca6ed561d2f9843822b46127290f46e6dd20d

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a348afe213e56f17b07c81a9cb7ad87a96e5e3d590b3d67bed4dad252f4573f0
MD5 dafcf7b1404326df28d0aae6b9bbf9c7
BLAKE2b-256 f5544d30ac846da35ae1596b30179866b317aeedb268443f6c2ca822712d511e

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8b538cac988a0fedcdcd1e4f172e6e1c96485049a3f5d432ee214aaa6b567b51
MD5 a87cf14675b8e77afcd9931ed022981a
BLAKE2b-256 03742a3f7891e035f3fac13d51890c86d777955acb5f8c1ec0e7e63ed6e4fa62

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d88c99886966cd8dfc8060a1302edfae5e8379ce5107986ad3cded817d12c517
MD5 3f00eecf3d5574e619dcb09594fd5bbf
BLAKE2b-256 fa603efce3e40ebf28f75864bad03d9d6e10559dfc5757722db8086bebbc3b78

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6f60b18dd0bb2f5b2944f0ef0461cb80fb2b19e998ab85ba1f4206a8cb69b152
MD5 d5e17e5ca4bb51d05bd67740f50342b9
BLAKE2b-256 f6dd5ff1a8f7e50471b3e4fd6ce5c6a8bec4664f4e039108fd19bd1ae4f1f6ba

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4f1479d53ddfd862f448ebef9225529496661313034d4fe9f455d6161c67acb0
MD5 9bcab409a436b95566d80b9443dc4300
BLAKE2b-256 35cc3ed78f67d81f618f84d895162361f0a81b66b63e816fadf22d1bfe949639

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9d2e3c695e60406ae97dc099c11d4c37e159efd3a94a3a0ec28f1ba3d8b7edaf
MD5 e0b1e3d3c874eef8b33decbfa712b7b3
BLAKE2b-256 ebf13700446f3f6e454a41b80103ff7aa0e4f89be7d147a25e9e80046e609e71

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5b0ff1aa00f38bd9a3ce2a75ecb8887e93dd7368c0d9d21b7ce3b623338cb75b
MD5 d5009af6f4281f129597eda5ae9f7d83
BLAKE2b-256 87f59c291eb474eb484e174be91064f412b9c2026e5638593766701c418683a5

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 80d355ab72e7778c3bba72e8fc0e13a9aae41c0753a2aedde68841ca610520d3
MD5 0605d1ee164628650ee0639ca9954612
BLAKE2b-256 09d8f3814548a5c3eb3a59769de56ceaa6f74471a2d4fd43f5511c4c0174e6d0

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f335a23ff64dae39d4b27e07b177e08559813b3745e91890afc4e1a1d071a7a1
MD5 24fa1ada9fc734997eb086e72dfc588b
BLAKE2b-256 2809c6a980a56e85938ee7088c4577e669d1fc56ca6e507a1a6b0dc04d23a7fc

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.19-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.19-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1a067f109e397a5e658305762992ef92a7043a28549fd73be3c2c55388f1dd3b
MD5 24551ab8bf4fd922d02e431f38bcae58
BLAKE2b-256 53a68777f838257ae4ec5eee4d1c305a26706e79aca48ed1c2bfd777d73e10ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page