Skip to main content

mmCIF parser written in Nim with Python bindings

Project description

nim-mmcif

Fast mmCIF (Macromolecular Crystallographic Information File) parser written in Nim with Python bindings

The goal of this repository is to experiment with vibe coding while building something useful for bioinformatics community, to see how much of a cross platform library can be driven to completion by transformers

Features

  • 🚀 High-performance parsing of mmCIF files using Nim
  • 🌍 Cross-platform support (Linux, macOS, Windows)
  • 📦 Easy installation via pip

Installation

Prerequisites

From PyPI (when available)

pip install nim-mmcif

From Source

# Install Nim (platform-specific, see below)
# macOS: brew install nim
# Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Windows: scoop install nim

# Install the package
git clone https://github.com/lucidrains/nim-mmcif
cd nim-mmcif
pip install -e .

For detailed platform-specific instructions, see CROSS_PLATFORM.md.

Quick Start

Python Usage

Dictionary Access (Default)

import nim_mmcif

# Parse an mmCIF file (returns dict by default)
data = nim_mmcif.parse_mmcif("path/to/file.mmcif")
print(f"Found {len(data['atoms'])} atoms")

# Access atom properties using dictionary notation
first_atom = data['atoms'][0]
print(f"Atom {first_atom['id']}: {first_atom['label_atom_id']}")
print(f"Position: ({first_atom['x']}, {first_atom['y']}, {first_atom['z']})")

# Parse multiple files using glob patterns
results = nim_mmcif.parse_mmcif("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Dataclass Access with Dot Notation (New!)

import nim_mmcif

# Parse with dataclass support for cleaner dot notation access
data = nim_mmcif.parse_mmcif("path/to/file.mmcif", as_dataclass=True)
print(f"Found {data.atom_count} atoms")

# Access atom properties using dot notation
first_atom = data.atoms[0]
print(f"Atom {first_atom.id}: {first_atom.label_atom_id}")
print(f"Position: ({first_atom.x}, {first_atom.y}, {first_atom.z})")
print(f"Chain: {first_atom.label_asym_id}, Residue: {first_atom.label_comp_id}")

# Use convenience properties and methods
print(f"Unique chains: {data.chains}")
print(f"Number of residues: {len(data.residues)}")

# Get all atoms from a specific chain
chain_a_atoms = data.get_chain('A')

# Get all atoms from a specific residue
residue_atoms = data.get_residue('A', 1)

# Get all positions as tuples
positions = data.positions  # List of (x, y, z) tuples

# Batch processing with dataclasses
results = nim_mmcif.parse_mmcif_batch(["file1.mmcif", "file2.mmcif"], as_dataclass=True)
for result in results:
    print(f"Structure has {result.atom_count} atoms in {len(result.chains)} chain(s)")

Other Functions

# Get atom count directly
count = nim_mmcif.get_atom_count("path/to/file.mmcif")
print(f"File contains {count} atoms")

# Get all atoms with their properties (returns list of dicts)
atoms = nim_mmcif.get_atoms("path/to/file.mmcif")
for atom in atoms[:5]:  # Print first 5 atoms
    print(f"Atom {atom['id']}: {atom['label_atom_id']} at ({atom['x']}, {atom['y']}, {atom['z']})")

# Get just the 3D coordinates
positions = nim_mmcif.get_atom_positions("path/to/file.mmcif")
for i, (x, y, z) in enumerate(positions[:5]):
    print(f"Position {i}: ({x:.3f}, {y:.3f}, {z:.3f})")

Nim Usage

import nim_mmcif/mmcif

# Parse an mmCIF file
let data = mmcif_parse("path/to/file.mmcif")
echo "Found ", data.atoms.len, " atoms"

# Iterate through atoms
for atom in data.atoms[0..<min(5, data.atoms.len)]:
  echo "Atom ", atom.id, ": ", atom.label_atom_id, 
       " at (", atom.Cartn_x, ", ", atom.Cartn_y, ", ", atom.Cartn_z, ")"

# Access specific atom properties
if data.atoms.len > 0:
  let firstAtom = data.atoms[0]
  echo "Chain: ", firstAtom.label_asym_id
  echo "Residue: ", firstAtom.label_comp_id
  echo "B-factor: ", firstAtom.B_iso_or_equiv

Batch Processing

Process multiple mmCIF files efficiently in a single operation:

import nim_mmcif

# List of mmCIF files to process
files = [
    "path/to/structure1.mmcif",
    "path/to/structure2.mmcif",
    "path/to/structure3.mmcif"
]

# Parse all files in batch (returns list when no globs used)
results = nim_mmcif.parse_mmcif_batch(files)

# Process results
for i, data in enumerate(results):
    print(f"Structure {i+1}: {len(data['atoms'])} atoms")
    
    # Analyze each structure
    atoms = data['atoms']
    if atoms:
        # Get unique chain IDs
        chains = set(atom['label_asym_id'] for atom in atoms)
        print(f"  Chains: {', '.join(sorted(chains))}")
        
        # Count residues
        residues = set((atom['label_asym_id'], atom['label_seq_id']) 
                      for atom in atoms)
        print(f"  Residues: {len(residues)}")

# Batch processing with glob patterns (returns dict)
results = nim_mmcif.parse_mmcif_batch("path/to/*.mmcif")
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

# Mix of glob patterns and regular paths (returns dict)
results = nim_mmcif.parse_mmcif_batch([
    "specific_file.mmcif",
    "structures/*.mmcif",
    "models/model_?.mmcif"
])
for filepath, data in results.items():
    print(f"{filepath}: {len(data['atoms'])} atoms")

Batch processing is particularly useful when:

  • Analyzing multiple protein structures for comparative studies
  • Processing entire datasets of crystallographic structures
  • Building machine learning datasets from PDB files
  • Performing high-throughput structural analysis

The batch function provides better performance than individual parsing when processing multiple files, as it reduces the overhead of repeated function calls.

API Reference

Functions

parse_mmcif(filepath: str, as_dataclass: bool = False) -> dict | MmcifData | dict[str, dict] | dict[str, MmcifData]

Parse an mmCIF file or files matching a glob pattern.

  • filepath: Path to mmCIF file or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • Single file + dict: Dictionary with 'atoms' key
    • Single file + dataclass: MmcifData instance
    • Glob pattern + dict: Dictionary mapping file paths to parsed data
    • Glob pattern + dataclass: Dictionary mapping file paths to MmcifData instances
  • Supports wildcards: * (any characters), ? (single character), ** (recursive)

parse_mmcif_batch(filepaths: list[str] | str, as_dataclass: bool = False) -> list[dict] | list[MmcifData] | dict[str, dict] | dict[str, MmcifData]

Parse multiple mmCIF files in a single operation.

  • filepaths: List of paths, single path, or glob pattern
  • as_dataclass: If True, returns MmcifData dataclass(es) with dot notation access
  • Returns:
    • No glob + dict: List of dictionaries with parsed data
    • No glob + dataclass: List of MmcifData instances
    • With glob + dict: Dictionary mapping file paths to parsed data
    • With glob + dataclass: Dictionary mapping file paths to MmcifData instances
  • More efficient than parsing files individually when processing multiple structures

get_atom_count(filepath: str) -> int

Get the number of atoms in an mmCIF file.

get_atoms(filepath: str) -> list[dict]

Get all atoms from an mmCIF file as a list of dictionaries.

get_atom_positions(filepath: str) -> list[tuple[float, float, float]]

Get 3D coordinates of all atoms as a list of (x, y, z) tuples.

Dataclasses

MmcifData

Container for parsed mmCIF data with typed atom access.

Properties:

  • atoms: List of Atom objects
  • atom_count: Total number of atoms
  • positions: List of (x, y, z) tuples for all atoms
  • chains: Set of unique chain identifiers
  • residues: Set of unique (chain_id, seq_id) tuples

Methods:

  • get_chain(chain_id: str): Get all atoms from a specific chain
  • get_residue(chain_id: str, seq_id: int): Get all atoms from a specific residue
  • to_dict(): Convert back to dictionary format

Atom

Represents a single atom with typed properties accessible via dot notation.

Properties:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • type_symbol: Element symbol
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_entity_id: Entity ID
  • label_seq_id: Residue sequence number
  • Cartn_x, Cartn_y, Cartn_z: 3D coordinates
  • x, y, z: Convenient aliases for coordinates
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor (temperature factor)
  • position: Tuple of (x, y, z) coordinates

Methods:

  • to_dict(): Convert back to dictionary format

Dictionary Format

When using the default dictionary format (as_dataclass=False), each atom dictionary contains:

  • type: Record type (ATOM or HETATM)
  • id: Atom serial number
  • label_atom_id: Atom name
  • label_comp_id: Residue name
  • label_asym_id: Chain identifier
  • label_seq_id: Residue sequence number
  • x, y, z: 3D coordinates (aliases for Cartn_x, Cartn_y, Cartn_z)
  • occupancy: Occupancy factor
  • B_iso_or_equiv: B-factor
  • And more...

Platform Support

Platform Architecture Python Status
Linux x64, ARM64 3.8-3.12
macOS x64, ARM64 3.8-3.12
Windows x64 3.8-3.12

Building from Source

Automatic Build

python build_nim.py

Manual Build

# Build using nimble tasks
nimble build         # Build debug version
nimble buildRelease  # Build optimized release version

Development

Running Tests

pip install pytest
pytest tests/ -v

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests
  5. Submit a pull request

Documentation

Performance

The Nim implementation provides significant performance improvements over pure Python parsers, especially for large mmCIF files commonly used in structural biology.

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Built with Nim for high performance
  • Python integration via nimporter and nimpy
  • mmCIF format specification from wwPDB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nim_mmcif-0.0.16.tar.gz (23.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nim_mmcif-0.0.16-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (153.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.16-cp312-cp312-macosx_11_0_arm64.whl (89.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nim_mmcif-0.0.16-cp312-cp312-macosx_10_9_x86_64.whl (95.0 kB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

nim_mmcif-0.0.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (153.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.16-cp311-cp311-macosx_11_0_arm64.whl (89.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

nim_mmcif-0.0.16-cp311-cp311-macosx_10_9_x86_64.whl (95.0 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

nim_mmcif-0.0.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (153.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.16-cp310-cp310-macosx_11_0_arm64.whl (89.6 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

nim_mmcif-0.0.16-cp310-cp310-macosx_10_9_x86_64.whl (95.0 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

nim_mmcif-0.0.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (153.0 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

nim_mmcif-0.0.16-cp39-cp39-macosx_11_0_arm64.whl (89.6 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

nim_mmcif-0.0.16-cp39-cp39-macosx_10_9_x86_64.whl (95.0 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

nim_mmcif-0.0.16-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (21.4 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

nim_mmcif-0.0.16-cp38-cp38-macosx_11_0_arm64.whl (16.2 kB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

nim_mmcif-0.0.16-cp38-cp38-macosx_10_9_x86_64.whl (15.8 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file nim_mmcif-0.0.16.tar.gz.

File metadata

  • Download URL: nim_mmcif-0.0.16.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for nim_mmcif-0.0.16.tar.gz
Algorithm Hash digest
SHA256 a78a4fd47249fb24f27019bf5aa36160c2b6ce7a8d42a429d72ec73baaa20d1c
MD5 6d31b3f4365c1294fbed1118feb92ef2
BLAKE2b-256 9fbc9604f5674e285123f136708dcfbc3a60592c9917e638fa0d59c09a0702d5

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2da2114212a61784b8253e3192bb3ece1344748cd76159f1f726ac3d91313fb8
MD5 12b028950b4fcd9cf3a80d4d7966e179
BLAKE2b-256 c414932ca7dfb14fe573d01360ccd2e961325c28cda368483f5df1744ff16c9a

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c4f17d6522bc5452b3784797397d60d1254562083a8d9015e99a60a38349e1c
MD5 e6977b46e07c764fc9faab4450b46331
BLAKE2b-256 2239245f2a7d132e539ce129d7998f62a12ac336a1c1829c5ccf2c1f4660ab22

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f6ff4fa05990e44b0b8c73055f381a3b0b96206c15aebf98f45644975ee453a3
MD5 5130be8b8e66e368a5b90e22dfdefab1
BLAKE2b-256 cd00569d2da1c714a082740d16b69a35dc47fb1a68bc83ab4080f7eaaa96ec57

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 221bedd0199596fb7f85823b49ef5bff354d05bf45f83a6e4866924087cb438c
MD5 b2eaa455e168da8b9838b42739ac3d85
BLAKE2b-256 5dd64f7d3c493bc6726a87feeeac74c682c786a5323d3687c3c3e49f038b910e

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c24d97dba6e70448c7c7e0de7e78a00b523ab5f804112d0bac5d0eb002d8a5ca
MD5 18635b2d81b2bc04bd3ecf902c6f96eb
BLAKE2b-256 b7864ac56095094134dd7c5849cbd6a0ae003e74651440914b98fedcc4d56122

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2aadc1d2304af89bf8e87f5ff73bf4df3faaa9d9645c815358a43fea5b3bd2c5
MD5 4ade00168cbf8e32a5d306158106f4cd
BLAKE2b-256 929d4e0681361d458407b7bc95b7c623d5f0dacee25f9173fe3504beacc8e54e

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3dc6fd857d6250200eef2db0af0072813377b84d78796785b3db4cc3eac64df5
MD5 f5e54db38fa78721f3daf0605c5ed4d3
BLAKE2b-256 590f5b8fde23ceae11f89aafb61bb1e5e5004a16a1c29ba7a37f3165bcda5373

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e5acb341c2bc4bd456080e7befeea8c19aec084d5ed89b4e0cfc14c40db96778
MD5 0703cdf4efaab390b26029e09f18356b
BLAKE2b-256 f37f22786327ef41439753557f99d5497e081238017f87ef8dbaf55f079aa618

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 89e7478a14f2cb4d509e5d846af665ef1217c7b84fa94b3c252da102e90d01b0
MD5 0ca008f8885b910dda943fd14f999428
BLAKE2b-256 d7f398e576291d028903685b83012d7dccca0cceffad0f8f39aab2d2b5c7810a

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dfa54b4e874f421dceba6039a6b4616a8f913558d4eea2d52cd09969f0042dfa
MD5 6c239af8b45f15d2b41f0e77223f576e
BLAKE2b-256 b836db41d771467fff518d4ae908a93734934b8d5f8b0d9ecf55bd8a3e726bff

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b65953b75244ccd9b6806e78c83ae3255014658e58ce454dbd74af7143f03d07
MD5 938e02f5e2c281499bb987416a5dae76
BLAKE2b-256 fa89637a45aa7842b7c5219ad01537d79d75d2750b0674bfdeda7b0479c6a243

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 805fa82ebd976db8b919f41e6e2069239eeffbb425d14f09b214e5a00f705d4c
MD5 b4d9f30204e2931bdec2b19fd1d1a293
BLAKE2b-256 d7c58e1bf29f41868a6417a7e475edf4de2bdf3b28ec6b155382f97ffdac75af

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0bacf9e6c3107d832a80ffca429e23ab661bc6c031a972bd36216d64cac7a55a
MD5 3270261770b356e019a6a8a53c74ba33
BLAKE2b-256 e1c22f51abe1059a1c1d8a0219cfb6c0084698554a9d54a57d8b8cfbce995c33

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 40338488a0e1619a65c278d6d844eff0a6a83b1ce4e8e0180197f36db5386421
MD5 4dabda0b6e7c08c7dec75311a05121df
BLAKE2b-256 1d5820fd304ad18291d2aa4c5608932c6ab5a144d5f58cdac105804d8f4fd056

See more details on using hashes here.

File details

Details for the file nim_mmcif-0.0.16-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for nim_mmcif-0.0.16-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 64a079aa25cfbb711405ba4649b94481a32bc6a419eb6e901889d302740ec49b
MD5 0fb2a96650d46defa98ff8e118f7d960
BLAKE2b-256 7e5ee17529be8fdc113485ebfa08adbf18861083254b680df6942a289515948e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page