Validate mmCIF/CIF files against the PDBx/mmCIF dictionary or any CIF dictionary
Project description
PDBe mmCIF Validator - Python Script
Version 0.1.6
A standalone Python script to validate mmCIF/CIF files against the PDBx/mmCIF dictionary or any CIF dictionary.
Features
- ✅ Validates mmCIF/CIF files against any CIF dictionary schema
- ✅ Checks for missing mandatory items (only for categories present in file)
- ✅ Validates item values against enumerations
- ✅ Data type validation - Automatically validates types with regex patterns from dictionary (email, phone, orcid_id, pdb_id, fax, etc.) plus hardcoded validations for dates, integers, floats, booleans
- ✅ Range validation - Distinguishes between strictly allowed boundary conditions (errors) and advisory boundary conditions (warnings)
- ✅ Parent/child category relationship validation - Ensures parent categories exist when child categories are present
- ✅ Foreign key integrity validation - Ensures referenced data exists in parent items
- ✅ Composite key validation - Validates that combinations of multiple child items together match corresponding combinations in parent categories
- ✅ Operation expression validation - Parses and validates complex operation expressions like
(1-60),(1,2,5),(X0)(1-5,11-15) - ✅ Duplicate category and item detection - Reports when a category or item is duplicated (in loop or frame format)
- ✅ Supports local dictionary files or downloading from URL (works with PDBx/mmCIF dictionary or any CIF dictionary format)
- ✅ Enhanced JSON output - Includes precise character positions and column indices for programmatic error handling
- ✅ Exit codes - Returns 0 for success, 1 for errors (useful for CI/CD integration)
Installation
Prerequisites
- Python 3.7 or higher (uses only Python standard library, no pip packages required)
- Internet connection (optional) - Only needed if downloading dictionary from URL. Can use local dictionary file for offline use.
- CIF dictionary file (optional) - Defaults to PDBx/mmCIF dictionary from URL, but can use any CIF dictionary format
Usage
Basic Usage
# Dictionary source can be a file path or URL (auto-detected)
# Works with PDBx/mmCIF dictionary or any CIF dictionary format
python validate_mmcif.py <dictionary.dic or URL> <mmcif_file.cif>
Using Local Dictionary File
# Use PDBx/mmCIF dictionary
python validate_mmcif.py mmcif_pdbx_v5_next.dic 6qvt.cif
# Or use any CIF dictionary file
python validate_mmcif.py path/to/your/cif_dictionary.dic your_file.cif
Using Dictionary from URL
# Using --url option (explicit) - defaults to PDBx/mmCIF dictionary
python validate_mmcif.py --url http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif
# Or use any CIF dictionary URL
python validate_mmcif.py --url https://example.com/path/to/your/dictionary.dic 6qvt.cif
# Or as positional argument (auto-detects URL)
python validate_mmcif.py http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif
Explicit Options
# Use local file
python validate_mmcif.py --file mmcif_pdbx_v5_next.dic 6qvt.cif
# Use URL
python validate_mmcif.py --url http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif
Help
python validate_mmcif.py --help
Library usage
You can use the validator as a Python library (e.g. in prerelease pipelines or other tools) by importing and calling the same logic the CLI uses. The library raises exceptions instead of exiting, so callers can handle errors.
Install
From the project (e.g. after cloning or from a wheel):
pip install -e /path/to/mmcif-validator/vscode-extension/python-script
# or from that directory:
pip install -e .
Or install from PyPI (when published):
pip install pdbe-mmcif-validator
When installed via pip, a validate-mmcif console script is also available:
validate-mmcif --file mmcif_pdbx_v5_next.dic file.cif
validate-mmcif --help
Basic usage
from pathlib import Path
from validate_mmcif import validate, ValidatorFactory, ValidationError
from validate_mmcif import DictionaryNotFoundError, CifNotFoundError, DownloadError
# Option 1: top-level function (recommended)
try:
errors = validate(Path("mmcif_pdbx_v5_next.dic"), Path("file.cif"))
for err in errors:
print(err.line, err.item, err.message, err.severity)
if not errors:
print("Validation passed.")
except DictionaryNotFoundError as e:
print("Dictionary not found:", e)
except CifNotFoundError as e:
print("mmCIF file not found:", e)
except DownloadError as e:
print("Download failed:", e)
# Option 2: factory (same behaviour)
errors = ValidatorFactory.validate(Path("dict.dic"), Path("file.cif"))
Using a dictionary from a URL
Download the dictionary first, then validate:
from pathlib import Path
from validate_mmcif import validate, download_dictionary, DownloadError
try:
dict_path = download_dictionary("http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic")
errors = validate(dict_path, Path("file.cif"))
# ... use errors ...
finally:
if dict_path.exists():
dict_path.unlink() # clean up temp file
except DownloadError as e:
print("Download failed:", e)
Integrating with your logging
The module uses the standard logging logger validate_mmcif. Configure logging so library messages go to your logs:
import logging
logging.basicConfig(level=logging.DEBUG)
# or attach to your app's logger:
logging.getLogger("validate_mmcif").setLevel(logging.INFO)
Exceptions
| Exception | When it is raised |
|---|---|
DictionaryNotFoundError |
The dictionary path does not exist. |
CifNotFoundError |
The mmCIF file path does not exist. |
DownloadError |
Downloading the dictionary from a URL failed. |
All of these inherit from MmCIFValidatorError, so you can catch that for any validator error.
Return value
validate() and ValidatorFactory.validate() return a list of ValidationError dataclass instances with:
line,item,message,severity("error"or"warning")column,start_char,end_char(optional, for positioning)
Output
The script outputs:
- Validation errors and warnings with line numbers
- JSON output for programmatic use
- Exit code 0 for success, 1 for errors
Example output:
Parsing dictionary: mmcif_pdbx.dic
Loaded 6652 items from dictionary
Parsing mmCIF file: model.cif
Found 1124 items in mmCIF file
Validating...
Found 4 validation issue(s):
ERROR: Line 36, Item '_pdbx_database_status.recvd_initial_deposition_date'
Value '20250601' does not match expected type 'yyyy-mm-dd'
ERROR: Line 1643, Item '_pdbx_struct_assembly_gen.oper_expression'
Operation expression '1' references operation ID '1' which does not exist in '_pdbx_struct_oper_list.id'
WARNING: Line 1011, Item '_refine.ls_R_factor_obs'
Out of advisory range: Value '0.350' is above advisory maximum '0.300'
ERROR: Line 1020, Item '_refine.ls_R_factor_obs'
Value '1.250' is above maximum allowed value '1.000'
JSON Output Format
The script outputs JSON at the end with the following structure:
{
"errors": [
{
"line": 1238,
"item": "_refine_ls_shell.number_reflns_R_free",
"message": "Out of advisory range: Value '0' is below minimum advised value '1'",
"severity": "warning",
"column": 5,
"start_char": 43,
"end_char": 44
}
]
}
Fields:
line: Line number (1-based) where the error occursitem: The item name (e.g.,_refine_ls_shell.number_reflns_R_free)message: Human-readable error messageseverity: Either"error"or"warning"column: Global column index (0-based) within the row (for loop data) ornullfor non-loop itemsstart_char: Character start position (0-based) within the line for precise highlighting, ornullif not availableend_char: Character end position (0-based) within the line for precise highlighting, ornullif not available
The start_char and end_char fields enable precise highlighting of the exact problematic value, even when the same value appears multiple times on a line.
Validation Checks
The validator performs the following checks:
- Item Definition: Verifies that items used in the mmCIF file are defined in the dictionary
- Mandatory Items: Checks that all mandatory items are present (only for categories that exist in the file)
- Enumeration Values: Validates that item values match allowed enumerations (reported as errors)
- Handles enumerations with only
_item_enumeration.value(no detail field) - Handles enumerations with both
valueanddetailfields
- Handles enumerations with only
- Data Type Validation: Validates that values match their expected data types:
- Regex patterns from dictionary - Automatically validates any type code that has a regex pattern defined in
_item_type_list.construct(e.g.,email,phone,orcid_id,pdb_id,fax, etc.) - Hardcoded validations for common types:
- Date formats:
yyyy-mm-dd,yyyy-mm-dd:hh:mm,yyyy-mm-dd:hh:mm-flex - Numeric types:
int,positive_int,float,float-range - Boolean type:
boolean
- Date formats:
- Regex patterns from dictionary - Automatically validates any type code that has a regex pattern defined in
- Range Validation: Checks that numeric values fall within specified minimum/maximum ranges
- Strictly Allowed Boundary Conditions (
_item_range): Violations are reported as errors - Advisory Boundary Conditions (
_pdbx_item_range): Violations are reported as warnings with "Out of advisory range:" prefix
- Strictly Allowed Boundary Conditions (
- Parent/Child Category Validation:
- Verifies that when a child category is present, its parent category is also present
- Example: If
entity_src_nat(child) is present,entity(parent) must also be present
- Foreign Key Integrity: Validates that foreign key values in child items exist in their parent items
- Example:
_entity_src_nat.entity_idvalues must exist in_entity.id
- Example:
- Composite Key Validation: Validates that combinations of multiple child items together match corresponding combinations in parent categories
- Example: In
pdbx_entity_poly_domain, the combination ofbegin_mon_id+begin_seq_nummust match a row inentity_poly_seqwheremon_id+numappear together as a pair - Validates relationships where multiple items form a composite foreign key (identified by
link_group_idin the dictionary) - Special handling for label/auth field combinations: Categories like
struct_conn,pdbx_struct_conn_angle,geom_*,atom_site_anisotrop, and others have composite keys that include both label and auth fields. The validator intelligently handles these by:- First attempting validation using label fields (if complete)
- Falling back to auth fields when label fields are incomplete (e.g., when
label_seq_idis missing) - Using
label_atom_idwhenauth_atom_idis not present in the file - This ensures atoms referenced in these categories are properly validated against
atom_siteeven when some fields are missing
- Example: In
- Operation Expression Validation: Validates
oper_expressionvalues that reference operation IDs- Parses complex operation expressions:
(1),(1,2,5),(1-4),(1,2)(3,4),(X0)(1-5,11-15) - Validates that all referenced operation IDs exist in
_pdbx_struct_oper_list.id - Example: If
oper_expressionis(1-60), validates that operation IDs 1 through 60 all exist
- Parses complex operation expressions:
- Category-aware validation: Only checks mandatory items for categories that are actually present in the mmCIF file
- First data block only: By default, only validates the first data block in files containing multiple data blocks (each starting with
data_)
Error vs Warning Severity
The validator reports issues with different severity levels:
Errors (Red Underline)
These are violations of mandatory constraints that must be fixed:
- Missing Mandatory Items: Required items that are missing from categories present in the file
- Enumeration Violations: Values that don't match the controlled vocabulary/enumeration list
- Data Type Mismatches: Values that don't match their expected data type (e.g., invalid date format, non-numeric value for integer type)
- Strictly Allowed Range Violations (
_item_range): Values outside the strictly allowed boundary conditions - Parent Category Missing: Child categories present but their required parent categories are missing
- Foreign Key Integrity Violations: Foreign key values that don't exist in their parent items
- Composite Key Violations: Combinations of multiple child items that don't match corresponding combinations in parent categories (including label/auth field combinations)
- Invalid Operation Expression References: Operation expressions referencing operation IDs that don't exist
Warnings (Yellow Underline)
These are advisory issues that may indicate problems but are not strictly required:
- Undefined Items: Items used in the mmCIF file that are not defined in the dictionary (only for items not starting with
_) - Advisory Range Violations (
_pdbx_item_range): Values outside the advisory boundary conditions (but within allowed range)
Command-Line Options
--file, -f: Path to local dictionary file (.dic)--url, -u: URL to download dictionary from- Positional arguments: Dictionary source (auto-detects file path or URL) and mmCIF file
Examples
# Validate with local dictionary
python validate_mmcif.py mmcif_pdbx_v5_next.dic 6qvt.cif
# Validate with online dictionary
python validate_mmcif.py --url http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 6qvt.cif
# Output to file
python validate_mmcif.py mmcif_pdbx_v5_next.dic 6qvt.cif > validation_results.txt
Troubleshooting
Dictionary file not found
- Check that the file path is correct
- Use absolute paths if relative paths don't work
- Or use the
--urloption to download from the internet
Validation script errors
- Ensure Python 3.7+ is installed:
python --version - Check file paths are correct
- Verify dictionary file format is correct
- For large files, validation may take time. When using the VSCode extension, the validation timeout is configurable in settings (default 60 seconds, max 600); increase
mmcifValidator.validationTimeoutSecondsif you see "Validation timed out".
Python not found
- Make sure Python is in your PATH
- Or use the full path to your Python executable, e.g.
python3 validate_mmcif.py ...or on WindowsC:\Python39\python.exe validate_mmcif.py ...
Limitations
- Dictionary parsing is simplified and may not handle all dictionary features
- Large dictionary files may take time to parse
- Some advanced validation rules may not be implemented yet
- Note: Some "missing mandatory item" errors may be false positives. In the mmCIF dictionary, items are often mandatory only when their parent category is present. The current implementation checks mandatory items only for categories that exist in the file, which should reduce false positives.
- Note: Some foreign key validation errors may be false positives if relationships are optional or conditional. The validator checks all defined parent/child relationships from
_pdbx_item_linked_group_list. - Note: For categories with both label and auth fields (like
struct_conn), the validator will attempt to validate using label fields first, then fall back to auth fields if label fields are incomplete. This ensures proper validation even when some fields are missing (e.g., whenlabel_seq_idis "." for non-polymer entities). - Data type validation uses regex patterns from the dictionary when available. Types like
email,phone,orcid_id,pdb_id, etc. are automatically validated if they have regex patterns defined in_item_type_list.construct. Types without regex patterns fall back to hardcoded validation (dates, int, positive_int, float, float-range, boolean) or are accepted without format validation.
Implemented Features
The validator currently implements comprehensive validation including:
- Data type validation - Validates int, float, date, etc., plus automatic validation of any type with regex pattern in dictionary (email, phone, orcid_id, pdb_id, fax, etc.)
- Range validation - Checks min/max values from dictionary constraints
- Parent/child category validation - Validates category hierarchies
- Foreign key integrity validation - Ensures referenced data exists
- Composite key validation - Validates that combinations of multiple child items match corresponding combinations in parent categories
- Label/auth field composite key handling - Special validation for categories with both label and auth fields (struct_conn, geom_*, etc.) that intelligently falls back between label and auth fields when some values are missing
- Operation expression validation - Parses and validates complex
oper_expressionvalues - Regular expression validation - Automatically extracts and uses regex patterns from dictionary for type validation
- Conditional relationship validation - Entity type-based validation for polymer/non-polymer relationships
- Loop structure parsing - Parses and validates loop structures in mmCIF files
- Category key extraction - Extracts and uses category keys from dictionary definitions
License
MIT
Author
Deborah Harrus, Protein Data Bank in Europe (PDBe), EMBL-EBI
Related
This script is part of the PDBe mmCIF Validator project, which also includes a Visual Studio Code extension for real-time validation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdbe_mmcif_validator-0.1.61.tar.gz.
File metadata
- Download URL: pdbe_mmcif_validator-0.1.61.tar.gz
- Upload date:
- Size: 39.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6801c7089f59f4e5beb82cafe0ff7b2d55edb3c1e93aa745faddd0e0e80a87f
|
|
| MD5 |
52ada86c968e68b973b26aedbaf413f1
|
|
| BLAKE2b-256 |
25e2bc2cf25833bf16d1d63074b0101a79b919bd138fdc338eefa3e23ba222c1
|
File details
Details for the file pdbe_mmcif_validator-0.1.61-py3-none-any.whl.
File metadata
- Download URL: pdbe_mmcif_validator-0.1.61-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6380fd18be0c92288c187cd7d16a860cf72da571377397db1a40c761cca51651
|
|
| MD5 |
d4aaebab1325a0834af86cb82795b61a
|
|
| BLAKE2b-256 |
28fe2566c4281c0c8bab3a91938bd5f63d97efdfa440e4305b0e9093aa339b67
|