Skip to main content

mmCIF parser written in Nim with Python bindings

Project description

nim-mmcif

Fast mmCIF (Macromolecular Crystallographic Information File) parser written in Nim with Python bindings

The goal of this repository is to experiment with vibe coding while building something useful for bioinformatics community, to see how much of a cross platform library can be driven to completion by transformers

Features

  • 🚀 High-performance parsing of mmCIF files using Nim
  • 🌍 Cross-platform support (Linux, macOS, Windows)
  • 📦 Easy installation via pip

Installation

Prerequisites

From PyPI

pip install nim-mmcif

From Source

# Install Nim (platform-specific, see below)
# macOS: brew install nim
# Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Windows: scoop install nim

# Install the package
git clone https://github.com/lucidrains/nim-mmcif
cd nim-mmcif
pip install -e .

For detailed platform-specific instructions, see CROSS_PLATFORM.md.

Quick Start

Python Usage

Dictionary Access

import nim_mmcif

# Parse an mmCIF file (returns dict by default)
data = nim_mmcif.parse_mmcif("path/to/file.mmcif")
print(f"Found {len(data['atoms'])} atoms")

# Access atom properties using dictionary notation
first_atom = data['atoms'][0]
print(f"Atom {first_atom['id']}: {first_atom['label_atom_id']}")
print(f"Position: ({first_atom['x']}, {first_atom['y']}, {first_atom['z']})")

# Parse multiple files using glob patterns
results = nim_mmcif.parse_mmcif("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Dataclass Access

import nim_mmcif

# Parse with dataclass support for cleaner dot notation access
data = nim_mmcif.parse_mmcif("path/to/file.mmcif", as_dataclass=True)
print(f"Found {data.atom_count} atoms")

# Access atom properties using dot notation
first_atom = data.atoms[0]
print(f"Atom {first_atom.id}: {first_atom.label_atom_id}")
print(f"Position: ({first_atom.x}, {first_atom.y}, {first_atom.z})")
print(f"Chain: {first_atom.label_asym_id}, Residue: {first_atom.label_comp_id}")

# Use convenience properties and methods
print(f"Unique chains: {data.chains}")
print(f"Number of residues: {len(data.residues)}")

# Get all atoms from a specific chain
chain_a_atoms = data.get_chain('A')

# Get all atoms from a specific residue
residue_atoms = data.get_residue('A', 1)

# Get all positions as tuples
positions = data.positions  # List of (x, y, z) tuples

# Batch processing with dataclasses
results = nim_mmcif.parse_mmcif_batch(["file1.mmcif", "file2.mmcif"], as_dataclass=True)
for result in results:
    print(f"Structure has {result.atom_count} atoms in {len(result.chains)} chain(s)")

Other Functions

# Get atom count directly
count = nim_mmcif.get_atom_count("path/to/file.mmcif")
print(f"File contains {count} atoms")

# Get all atoms with their properties (returns list of dicts)
atoms = nim_mmcif.get_atoms("path/to/file.mmcif")
for atom in atoms[:5]:  # Print first 5 atoms
    print(f"Atom {atom['id']}: {atom['label_atom_id']} at ({atom['x']}, {atom['y']}, {atom['z']})")

# Get just the 3D coordinates
positions = nim_mmcif.get_atom_positions("path/to/file.mmcif")
for i, (x, y, z) in enumerate(positions[:5]):
    print(f"Position {i}: ({x:.3f}, {y:.3f}, {z:.3f})")

Nim Usage

import nim_mmcif/mmcif

# Parse an mmCIF file
let data = mmcif_parse("path/to/file.mmcif")
echo "Found ", data.atoms.len, " atoms"

# Iterate through atoms
for atom in data.atoms[0..<min(5, data.atoms.len)]:
  echo "Atom ", atom.id, ": ", atom.label_atom_id, 
       " at (", atom.Cartn_x, ", ", atom.Cartn_y, ", ", atom.Cartn_z, ")"

# Access specific atom properties
if data.atoms.len > 0:
  let firstAtom = data.atoms[0]
  echo "Chain: ", firstAtom.label_asym_id
  echo "Residue: ", firstAtom.label_comp_id
  echo "B-factor: ", firstAtom.B_iso_or_equiv

Batch Processing

Process multiple mmCIF files efficiently in a single operation:

import nim_mmcif

# List of mmCIF files to process
files = [
    "path/to/structure1.mmcif",
    "path/to/structure2.mmcif",
    "path/to/structure3.mmcif"
]

# Parse all files in batch (returns list when no globs used)
results = nim_mmcif.parse_mmcif_batch(files)

# Process results
for i, data in enumerate(results):
    print(f"Structure {i+1}: {len(data['atoms'])} atoms")
    
    # Analyze each structure
    atoms = data['atoms']
    if atoms:
        # Get unique chain IDs
        chains = set(atom['label_asym_id'] for atom in atoms)
        print(f"  Chains: {', '.join(sorted(chains))}")
        
        # Count residues
        residues = set((atom['label_asym_id'], atom['label_seq_id']) 
                      for atom in atoms)
        print(f"  Residues: {len(residues)}")

# Batch processing with glob patterns (returns dict)
results = nim_mmcif.parse_mmcif_batch("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

# Mix of glob patterns and regular paths (returns dict)
results = nim_mmcif.parse_mmcif_batch([
    "specific_file.mmcif",
    "structures/*.mmcif",
    "models/model_?.mmcif"
])
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Batch processing is particularly useful when:

  • Analyzing multiple protein structures for comparative studies
  • Processing entire datasets of crystallographic structures
  • Building machine learning datasets from PDB files
  • Performing high-throughput structural analysis

The batch function provides better performance than individual parsing when processing multiple files, as it reduces the overhead of repeated function calls.

API Reference

Functions

parse_mmcif(filepath: str, as_dataclass: bool = False) -> dict | MmcifData | dict[str, dict] | dict[str, MmcifData]

Parse an mmCIF file or files matching a glob pattern.

  • filepath: Path to mmCIF file or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • Single file + dict: Dictionary with 'atoms' key
    • Single file + dataclass: MmcifData instance
    • Glob pattern + dict: Dictionary mapping file paths to parsed data
    • Glob pattern + dataclass: Dictionary mapping file paths to MmcifData instances
  • Supports wildcards: * (any characters), ? (single character), ** (recursive)

parse_mmcif_batch(filepaths: list[str] | str, as_dataclass: bool = False) -> list[dict] | list[MmcifData] | dict[str, dict] | dict[str, MmcifData]

Parse multiple mmCIF files in a single operation.

  • filepaths: List of paths, single path, or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • No glob + dict: List of dictionaries with parsed data
    • No glob + dataclass: List of MmcifData instances
    • With glob + dict: Dictionary mapping file paths to parsed data
    • With glob + dataclass: Dictionary mapping file paths to MmcifData instances
  • More efficient than parsing files individually when processing multiple structures

get_atom_count(filepath: str) -> int

Get the number of atoms in an mmCIF file.

get_atoms(filepath: str) -> list[dict]

Get all atoms from an mmCIF file as a list of dictionaries.

get_atom_positions(filepath: str) -> list[tuple[float, float, float]]

Get 3D coordinates of all atoms as a list of (x, y, z) tuples.

Dataclasses

MmcifData

Container for parsed mmCIF data with typed atom access.

Properties:

  • atoms: List of Atom objects
  • atom_count: Total number of atoms
  • positions: List of (x, y, z) tuples for all atoms
  • chains: Set of unique chain identifiers
  • residues: Set of unique (chain_id, seq_id) tuples

Methods:

  • get_chain(chain_id: str): Get all atoms from a specific chain
  • get_residue(chain_id: str, seq_id: int): Get all atoms from a specific residue
  • to_dict(): Convert back to dictionary format

Atom

Represents a single atom with typed properties accessible via dot notation.

Properties:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • type_symbol: Element symbol
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_entity_id: Entity ID
  • label_seq_id: Residue sequence number
  • Cartn_x, Cartn_y, Cartn_z: 3D coordinates
  • x, y, z: Convenient aliases for coordinates
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor (temperature factor)
  • position: Tuple of (x, y, z) coordinates

Methods:

  • to_dict(): Convert back to dictionary format

Dictionary Format

When using the default dictionary format (as_dataclass=False), each atom dictionary contains:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_seq_id: Residue sequence number
  • x, y, z: 3D coordinates (aliases for Cartn_x, Cartn_y, Cartn_z)
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor
  • And more...

Platform Support

Platform Architecture Python Status
Linux x64, ARM64 3.8-3.12
macOS x64, ARM64 3.8-3.12
Windows x64 3.8-3.12

Building from Source

Automatic Build

python build_nim.py

Manual Build

# Build using nimble tasks
nimble build         # Build debug version
nimble buildRelease  # Build optimized release version

Development

Running Tests

pip install pytest
pytest tests/ -v

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

Documentation

Performance

The Nim implementation provides significant performance improvements over pure Python parsers, especially for large mmCIF files commonly used in structural biology.

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with Nim for high performance
  • Python integration via nimporter and nimpy
  • mmCIF format specification from wwPDB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nim_mmcif-0.0.17.tar.gz (23.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nim_mmcif-0.0.17-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (93.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.17-cp312-cp312-macosx_11_0_arm64.whl (65.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nim_mmcif-0.0.17-cp312-cp312-macosx_10_9_x86_64.whl (65.2 kB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

nim_mmcif-0.0.17-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (93.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.17-cp311-cp311-macosx_11_0_arm64.whl (65.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nim_mmcif-0.0.17-cp311-cp311-macosx_10_9_x86_64.whl (65.2 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

nim_mmcif-0.0.17-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (93.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.17-cp310-cp310-macosx_11_0_arm64.whl (65.6 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nim_mmcif-0.0.17-cp310-cp310-macosx_10_9_x86_64.whl (65.2 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

nim_mmcif-0.0.17-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (92.9 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.17-cp39-cp39-macosx_11_0_arm64.whl (65.6 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nim_mmcif-0.0.17-cp39-cp39-macosx_10_9_x86_64.whl (65.2 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

nim_mmcif-0.0.17-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (21.4 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

nim_mmcif-0.0.17-cp38-cp38-macosx_11_0_arm64.whl (16.2 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nim_mmcif-0.0.17-cp38-cp38-macosx_10_9_x86_64.whl (15.8 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file nim_mmcif-0.0.17.tar.gz.

File metadata

  • Download URL: nim_mmcif-0.0.17.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nim_mmcif-0.0.17.tar.gz
Algorithm Hash digest
SHA256 75f37762ee279b0e9fa218a9403a55c00c0d65774f27dd78f093e92784bd0d24
MD5 d4cfd58c2b768ded60c6d0e39aedff22
BLAKE2b-256 c0abd1d43a95b28bc17ddec1653a8754eff96071a4cbb9d929c16829b3c23543

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 115a57d5d3dd6be29b28d9871c9352f8a3750d531e7aed22df5f81be7217f09e
MD5 88092ae6cce2f87e63586596eedd0ac8
BLAKE2b-256 880564ce21dd63c4a834b3d7f9c8160aa78935eeacff19a668d869ef7e751b55

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f92a9fa1b030d08f3e1668be29aef2d8e975507cff5934752e7c1fb0fe7a01a9
MD5 580cd661a47551cbdd3e82e1600fe187
BLAKE2b-256 a6b3d86e0caa292a2ac42fe4e6e6d0761a02f49effeaf62dc3e565b1b8ddc5ba

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 06cc0623177a862ed78595e8e87c07af486677a1dd19057fb846bfcbb1ac9995
MD5 75149fb4f23dba7be130158acd848763
BLAKE2b-256 fc66f630c361ef4e4f1f7a8f16c951badecab5b606f08996cc3e643ced4adea8

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cfd7cfb47ae82206f9c539db669efa8dc8a4e25d5bee6709a71604dee3351c8a
MD5 0856579df2b9908c4400cb8078629fc3
BLAKE2b-256 53828031b78158b7030a7601c3c4d87dfc5ed75c8832e3a3ea7440cd39aaf0d2

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f1957ecdb00ed97df8c5b748e906630d21bca75c1c93c89eff3d476efee9e371
MD5 0eb71ba7ff390cc545a2ec16a2726108
BLAKE2b-256 9067ec7a50861a69d8cbb5a01d99b127a5d6b6f56e044fb3fdcbc178f176cb53

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7aa3169e7d1de9195e504ff31235a350a06a55efab3c88e4bdf7dc189ddfc6b8
MD5 f67198e365e0e7be3531f68ffc8372d5
BLAKE2b-256 b18ea90abb3376b93d3ef4c0d97439ca1de249b24bdb9f898e4c9802b0346972

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9b1a2654d3417a9c23430e6b2526dc76af3ee6b603375e7e4c9ce1fc96ccb8cc
MD5 b7367a48cec8f84bb56aed49eecd1d9e
BLAKE2b-256 5a40da039bbf9cabcf6dea525b054a78e918d4dab7903e365d67597c7913adac

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 22e0eb670ba30145b9dda9e0ad5e4e560d5eb6385972facc2eb3d5f7ca543363
MD5 51fca717d825e633556eacbc99db9a01
BLAKE2b-256 1c01f32f5dfb576550b228cee988916f01ed0fa73fc7d30eaf241148caa063df

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 363d23894f505f68ef4aea008b2abe2d34fd4a4fa7e5f8318fd246b38498aa0a
MD5 4913040953b141dda1f853cc86f9d986
BLAKE2b-256 ad30f3c8a9b07da78d83553ba16c480376f29916e62f372442d3e2c3d2ec9ff2

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 79feb91db9d781bcbb0a78a036d4b989790e92d2bbab983d0a65a345f3b2a20a
MD5 cbf0fa6060d19f02e47cba26ea545c64
BLAKE2b-256 694bb5f950acd97de71593f35750fda6c7201793ba0722fcb52106eddc0dd811

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3764a8c16757f7e49a7d7f57f47f0c752487069b238ef5c751162c4c517ddefb
MD5 01cfb23772eb1caf0491b89495b1e8a8
BLAKE2b-256 2700a9aeccb06b365053d24963b7d03dfe2999b41728fec443efc78dc8bdefe4

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 16fe7ff0e49b8dce5109225979d4d5e3b63407a2f05d4ea74cc92ced63fb7817
MD5 31cd0a5e513829f0eace2448ba343dd9
BLAKE2b-256 3de694be89dd82dd09d7b6b925c8fa373158d728095c4745b9af142678c7d51b

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3fb40ff2bcd45fd14a87eb98116fbcb863691bfb6f8050634323b16067a64519
MD5 ff9c3584c254ebf1bcdbc30d25e42345
BLAKE2b-256 779b95b2c047eb5c66d572c8c83fab0cb792200371fc63130072b2f70f342e74

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1603574223ee9b8ecde24b33119751d17042f5c2cefab43796a9bdc544f97064
MD5 464c729281a544451485c6dca500cc1d
BLAKE2b-256 271dae896884758baf23867e7dbe6dced97e6e186f739f0a246e6be56fc402df

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.17-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.17-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9e85a699a1884e08ac559937e9cdf121fbf6512e98b03faa6f54aa5bd357f9cf
MD5 d46e928145fa86f1df43c97b9d8c5844
BLAKE2b-256 279e5ebd5c500e7724f8fd8f0463476af52b8e6e7b8f4d0b240620857dcea8bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page