Skip to main content

Import metadata from mmCIF files using CSV specifications.

Project description

mmCIF Metadata Importer

This tool imports metadata from mmCIF files into new metadata-only files or into existing models. It uses the gemmi library, with automatic method detection and method-specific CSV specification files.

Protein Data Bank in Europe (PDBe) · pdbe.org

User guide

User tutorial

The same content lives in the repository as docs/user-tutorial.html. The site root (docs/index.html) redirects there so https://pdbeurope.github.io/mmcif-metadata-import/ serves the tutorial.

Installation

From PyPI (recommended): installs the mmcif-metadata-import command-line tool.

pip install mmcif-metadata-import

From source (clone the repository): install dependencies, then run with python import_metadata.py using the same arguments as below.

pip install -r requirements.txt

Jupyter notebook (interactive UI)

A Jupyter notebook provides an interactive form (file upload, checkboxes, run button)—no command line or web hosting needed.

Run in browser (no install):
Binder
Click the badge to open the notebook on mybinder.org (repo). The first launch may take a few minutes while the environment builds (gemmi install). Download any output files from the notebook links before closing the tab—Binder sessions are temporary.

Run locally:

  1. Install notebook dependencies:
pip install -r requirements-notebook.txt
  1. Start Jupyter (jupyter notebook or jupyter lab), open metadata_import.ipynb, and run all cells. Use the widgets to upload mmCIF files, select specifications, and run import. Outputs are saved in notebook_output/.

Usage

mmcif-metadata-import <input_file> [--xray] [--xray_serial] [--em] [--nmr] [--macromolecules] [--citation] [--authors] [--funding] [--keywords] [-o output_file] [--merge_to_file target_file] [--log]

Arguments

  • input_file: Input mmCIF file (supports .cif and .cif.V[ordinal] extensions)
  • --xray: Optional flag to include X-ray specific categories from specs/XRAY.csv
  • --xray_serial: Optional flag to include X-ray serial specific categories from specs/XRAY_SERIAL.csv
  • --em: Optional flag to include electron microscopy specific categories from specs/EM.csv
  • --nmr: Optional flag to include NMR specific categories from specs/NMR.csv
  • --macromolecules: Optional flag to include macromolecules categories from specs/MACROMOLECULES.csv
  • --citation: Optional flag to include citation categories from specs/CITATION.csv
  • --authors: Optional flag to include author categories from specs/AUTHORS.csv
  • --funding: Optional flag to include funding categories from specs/FUNDING.csv
  • --keywords: Optional flag to include keyword categories from specs/KEYWORDS.csv
  • -o, --output: Optional output file name (default: [input_name]_metadata.cif)
  • --merge_to_file: Optional file path to merge imported metadata into (instead of creating a new file). Metadata will be added to the first data block of the target file. The output file will be named <originalname>_merged_with_<inputfilename> in the same directory as the target file.
  • --log: Optional flag to generate a log file with detailed information about the import process. The log file is automatically named based on the output file (same name with .log extension) and placed in the same directory as the output file.

Note: At least one specification file must be provided.

Merge Mode: When --merge_to_file is provided, the imported metadata will be merged into the first data block of the specified file. The metadata items will be added at the end of the first data block, before any subsequent data blocks. A new file will be created with the name pattern <originalname>_merged_with_<inputfilename> (e.g., if merging target.cif with metadata from input.cif, the output will be target_merged_with_input.cif). The original target file is not modified. Important: Categories and items that already exist in the target file will not be merged to avoid overwriting existing data. These will be reported in the log file as "Categories not imported" and "Items not imported". If --merge_to_file is not provided, a new metadata file will be created as specified by -o/--output.

Method Validation: The script automatically detects the input file's method and validates method-specific flags. If you try to use --xray on an EM file, the script will warn you and skip the X-ray specification to prevent importing incompatible metadata.

Examples

# Basic usage with method-specific files
mmcif-metadata-import input.cif --xray
mmcif-metadata-import input.cif --xray_serial
mmcif-metadata-import input.cif --em
mmcif-metadata-import input.cif --nmr

# With custom output name
mmcif-metadata-import input.cif --xray -o custom_output.cif

# Using only optional specification files
mmcif-metadata-import input.cif --macromolecules
mmcif-metadata-import input.cif --citation --authors
mmcif-metadata-import input.cif --funding --keywords

# Combine method-specific with optional files
mmcif-metadata-import input.cif --em --macromolecules
mmcif-metadata-import input.cif --xray --citation --authors
mmcif-metadata-import input.cif --nmr --funding --keywords

# Multiple method-specific files
mmcif-metadata-import input.cif --xray --xray_serial --em --nmr

# All optional categories
mmcif-metadata-import input.cif --macromolecules --citation --authors --funding --keywords

# Everything together
mmcif-metadata-import input.cif --xray --em --nmr --macromolecules --citation --authors --funding --keywords

# Method validation example (EM file with X-ray flag - X-ray will be skipped)
mmcif-metadata-import em_file.cif --em --xray --macromolecules
# Output: "Warning: Skipping X-ray specification - input file method (EM_MAP_ONLY) doesn't match X-ray method"

# Merge metadata into an existing file (single data block)
mmcif-metadata-import input.cif --xray --merge_to_file target.cif

# Merge metadata into an existing file with multiple data blocks
mmcif-metadata-import input.cif --xray --merge_to_file target_multiple_datablocks.cif
# Metadata will be added to the first data block, before the second data block

# Generate a log file with detailed import information (automatically named input.log)
mmcif-metadata-import input.cif --xray --log

# Combine merge with log file (log file automatically named based on merge output)
mmcif-metadata-import input.cif --xray --merge_to_file target.cif --log
# Log file will be: target_merged_with_input.log (same directory as target)

Method Detection

The script automatically detects the source method (FROM) from the input mmCIF file based on:

  • XRAY: exptl.method = "X-RAY DIFFRACTION"
  • NMR: exptl.method = "SOLUTION NMR"
  • EM_MAP_ONLY: exptl.method = "ELECTRON MICROSCOPY" + database_2.database_id contains "WWPDB" and "EMDB"
  • EM_MODEL_ONLY: exptl.method = "ELECTRON MICROSCOPY" + database_2.database_id contains "WWPDB" and "PDB"
  • EM_MAP_MODEL: exptl.method = "ELECTRON MICROSCOPY" + database_2.database_id contains "WWPDB", "PDB", and "EMDB"

Specification Files

All specification CSV files are located in the specs/ subdirectory.

The script uses simplified method-specific CSV files:

Method-Specific Files:

  • specs/XRAY.csv - X-ray crystallography specific categories
  • specs/XRAY_SERIAL.csv - X-ray serial specific categories
  • specs/EM.csv - Electron microscopy specific categories
  • specs/NMR.csv - Nuclear magnetic resonance specific categories

Optional Specification Files

The script supports several optional flags that add additional categories from separate CSV files. These are merged with the method-specific specification file to provide comprehensive metadata import.

Available Optional Files:

--macromolecules (specs/MACROMOLECULES.csv)

Contains macromolecule-related categories:

  • _entity, _entity_name_com, _entity_poly, _entity_poly_seq
  • _entity_src_nat, _entity_src_gen, _pdbx_entity_src_syn
  • _struct_ref, _struct_ref_seq, _struct_ref_seq_dif

--citation (specs/CITATION.csv)

Contains citation-related categories:

  • _citation, _citation_author

--authors (specs/AUTHORS.csv)

Contains author-related categories:

  • _pdbx_contact_author, _em_author_list

--funding (specs/FUNDING.csv)

Contains funding-related categories:

  • _pdbx_audit_support

--keywords (specs/KEYWORDS.csv)

Contains keyword-related items:

  • _struct_keywords.text, _struct_keywords.pdbx_keywords, _struct_keywords.pdbx_details

All optional categories are merged with the method-specific specification file to provide comprehensive metadata information in the output.

CSV Specification File Format

Each CSV specification file should contain the following columns:

  • category: The mmCIF category name (e.g., _pdbx_contact_author)
  • item: The specific item name within the category (e.g., id, name_first). Leave empty for category-level specifications.
  • should_import: Whether to include this category/item (Y for yes, N for no)
  • type: Either category (for entire category) or item (for specific items)

Example CSV structure:

category,item,should_import,type
_pdbx_contact_author,,Y,category
_citation,,Y,category
_struct_keywords,text,Y,item
_struct_keywords,pdbx_keywords,Y,item
_database_2,,N,category
_struct_keywords,entry_id,N,item

Annotated Example:

# Header row
category,item,should_import,type

# Include entire _pdbx_contact_author category (all items)
_pdbx_contact_author,,Y,category

# Include entire _citation category (all items)
_citation,,Y,category

# Include only specific items from _struct_keywords category
_struct_keywords,text,Y,item                    # Include _struct_keywords.text
_struct_keywords,pdbx_keywords,Y,item           # Include _struct_keywords.pdbx_keywords
_struct_keywords,entry_id,N,item                # Exclude _struct_keywords.entry_id

# Exclude entire _database_2 category (no items)
_database_2,,N,category

Key Points:

  • Empty item column = entire category (use type=category)
  • Filled item column = specific item (use type=item)
  • Y = include this category/item
  • N = exclude this category/item

Output

The script creates a new mmCIF file containing only the specified categories and items from the input file. The output filename follows the pattern [input_name]_metadata.cif.

Output Format: The output file does not include a data_ block declaration line at the beginning. This allows the metadata content to be easily appended to the first data block of an existing mmCIF file. The file starts directly with the metadata categories and items.

Log File

When using the --log flag, a detailed log file is automatically generated with the same name as the output file but with a .log extension, placed in the same directory as the output file. For example:

  • If output file is input_metadata.cif, the log file will be input_metadata.log
  • If merge output is target_merged_with_input.cif, the log file will be target_merged_with_input.log (same directory as the merge output)

The log file contains:

  • Requested Categories and Items: Lists all categories and items that were requested to be imported based on the specification files
  • Skipped Specifications: Lists any specification files that were skipped (e.g., due to method mismatch) with the reason
  • Imported Categories and Items: Lists all categories and items that were successfully imported
  • Categories Not Found: Lists categories that were requested but not found in the input file
  • Items Not Found: Lists items that were requested but not found in the input file
  • Categories Not Imported (merge mode only): Lists categories that were not imported because they already exist in the target file
  • Items Not Imported (merge mode only): Lists items that were not imported because they already exist in the target file
  • Summary: Provides counts of requested vs imported categories/items, skipped specifications, categories not found, items not found, and (for merge mode) categories/items not imported

This log file is useful for debugging and understanding what metadata was imported and what was skipped.

Features

  • Supports both .cif and .cif.V[ordinal] input file extensions
  • Processes only the first data block in multi-block mmCIF files
  • Handles both single items and loop structures in mmCIF files
  • Uses CSV format for easy specification management
  • Provides detailed error messages for file reading/writing issues
  • Optional log file generation for detailed import tracking

Author & Affiliation

Deborah HarrusProtein Data Bank in Europe (PDBe)

License

This project is licensed under the Apache License 2.0. See LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmcif_metadata_import-0.1.0.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmcif_metadata_import-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file mmcif_metadata_import-0.1.0.tar.gz.

File metadata

  • Download URL: mmcif_metadata_import-0.1.0.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mmcif_metadata_import-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0d2c5e794500900fdcb6b41e79ad0d9943ccd08ae16d4801d250efaa3277781f
MD5 b79588839015c59b186b3a51893ca4bb
BLAKE2b-256 829475a93a7478ca3af99b11089db34f63b20bfd1f07e16368f0bab6d7103646

See more details on using hashes here.

File details

Details for the file mmcif_metadata_import-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mmcif_metadata_import-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4fd6fe7066c43d32510ca287db0b24afe7b75806e7033be9e0f52c1dc1f78eba
MD5 318605d4de5dc0906664df9a7652167e
BLAKE2b-256 2b974c6e072392c84a967042a8c3b5750c0a0314921eb7913c17e65d7f6751d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page