Skip to main content

Parse, store, validate, and emit Crystallographic Information Files (CIF)

Project description

cifflow

Parse, store, validate, and emit Crystallographic Information Files (CIF).

Python ≥ 3.10 · Apache 2.0 · v0.1.2 · pre-release (not yet on PyPI)


What it does

  • Parses CIF 1.1 and CIF 2.0 files, including all string types (triple-quoted, multiline text fields, embedded quotes) and save frames
  • Loads DDLm dictionaries with full _import.get resolution, producing a typed schema
  • Focused on multi-block powder CIF files
  • Ingests parsed CIF data into DuckDB using the dictionary-derived schema: one table per category, foreign keys enforced, unknown tags routed to a fallback tier
  • Emits valid CIF from a populated database in four modes: ORIGINAL, GROUPED, ONE_BLOCK, ALL_BLOCKS
  • Trusts the user — if you pass in multiple blocks, the program assumes they all belong together and, failing key value clashes, can be interpreted as a single database/experiment
  • Constructs CifFile objects programmatically from Python values (CifWriter), and performs arbitrary edits: add/remove/rename tags, loops, blocks, and save frames
  • Removes common parse-time artefacts automatically (clean): orphan error tags, duplicate blocks/save frames/tags, loop padding; for anything beyond these automatic fixes, use CifWriter
  • Visualises a schema as a Graphviz DOT string or a self-contained interactive HTML file
  • Returns data as Apache Arrow RecordBatch objects directly from the Rust parser (build_arrow, build_arrow_file)

Key properties

Error-tolerant. The parser never raises on malformed input. Every structural problem produces an explicit error event; parsing continues and all recoverable data is preserved.

No silent data loss. Duplicate tag values are preserved. Tags not mapped by the dictionary go to a fallback table, not a discard pile.

Round-trip fidelity. For well-formed input, emitted CIF re-parses to the same data. All values are stored and emitted as raw strings; ValueType provenance (placeholder . and ? vs quoted equivalents) is preserved throughout.

Canonical caseless names. Block names, save frame names, and tag names are stored in Unicode canonical caseless form (NFC(casefold(NFD(x)))). Lookups are automatically casefolded: cif["ABC"] finds a block stored as "abc".

Streaming parser. The parser is event-driven. CIF source is consumed in a single pass; the IR accumulates events incrementally. The Rust extension provides high-throughput Arrow output without any Python file objects.


Installation

cifflow is not yet on PyPI. Install directly from source:

git clone https://github.com/rowlesmr/cifflow.git
cd cifflow
pip install -e ".[dev]"
.venv/Scripts/maturin develop   # or: maturin develop, if maturin is on PATH

duckdb and pyarrow are declared dependencies and are installed automatically. The Rust extension (cifflow_core) is compiled by maturin during the second step.


Quick start

Parse a CIF file

from cifflow import build

text = open('structure.cif', encoding='utf-8').read()
cif, errors = build(text)   # never raises; errors is a list[ParseError]

for block_name in cif.blocks:          # block names are always lowercase
    block = cif[block_name]
    print(f'{block_name}: {len(block.tags)} tags, {len(block.loops)} loops')

The best way to resolve errors is to inspect the list of errors, edit the file accordingly, and try again. No assumptions are made about how to correct errors automatically.

Full pipeline: dictionary → DuckDB → CIF

import pathlib
from cifflow import (
    DictionaryLoader, directory_resolver,
    save_dictionary, load_dictionary,
    generate_schema,
    build, ingest, emit, EmitMode,
)
from cifflow.types import CifVersion

# 1. Load dictionary (with JSON cache to avoid re-parsing on every run)
cache = pathlib.Path('cif_pow_cache.json')
resolver = directory_resolver('data/dictionaries')
if cache.exists():
    dictionary = load_dictionary(cache)
else:
    dictionary = DictionaryLoader(resolver=resolver).load(
        open('data/dictionaries/cif_pow.dic', encoding='utf-8').read())
    save_dictionary(dictionary, cache)

# 2. Derive schema
schema = generate_schema(dictionary)

# 3. Parse CIF
cif, errors = build(open('all_the_data.cif', encoding='utf-8').read())

# 4. Ingest into an in-memory DuckDB database
#    Pass a file path string to persist: ingest(cif, 'output.db', schema=schema)
conn, warnings = ingest(cif, schema=schema)

# 5. Emit CIF
output = emit(conn, schema, mode=EmitMode.ORIGINAL, version=CifVersion.CIF_2_0)
open('output.cif', 'w', encoding='utf-8').write(output)

See example_workflow.py in the repository root for a fully annotated end-to-end demonstration covering all four emission modes, type-cast export, and fidelity checking.

The full API reference is in docs/api.md.


Architecture

Parser → Event Stream → IR → Dictionary-aware Mapping → DuckDB → Output/API
Layer Responsibility
Lexer Tokenisation, ValueType assignment
Parser Token sequence interpretation, error recovery, event emission
IR (CIFModel) Event accumulation, loop validation, multiline text transformation
Dictionary DDLm parsing, schema derivation
DuckDB Persistent storage: structured tables when a dictionary is present, fallback tier otherwise
Output Valid CIF regeneration; Python/NumPy/pandas API surface

Layer responsibilities are strictly separated. The parser does not know about the dictionary. The dictionary does not know about the IR. The output layer only reads from DuckDB.


Status

All stages are complete and tested.

Stage Feature
1–2 CIF 1.1 and 2.0 parser + IR (CIF model)
3 DDLm dictionary loading (_import.get, alias resolution, deprecation)
4 DuckDB schema generation (Set/Loop → tables, PKs, FKs, bridge columns, fallback tier)
5 DuckDB ingestion: structured tables + fallback tier; FK propagation; error recovery; canonical caseless name matching
6 CIF emission (ORIGINAL, GROUPED, ONE_BLOCK, ALL_BLOCKS); pretty-print; line-length enforcement; decimal alignment; schema visualisation; programmatic CifFile construction (CifWriter); cleaning parser artefacts (clean); type-cast export (convert_database); fidelity checking (check_fidelity); validation (validate)

Development

Run the fast test suite (excludes tests that load large real-world CIF files):

.venv/Scripts/python.exe -m pytest -m "not slow"

Run the full suite including slow tests:

.venv/Scripts/python.exe -m pytest

After modifying the Rust extension, recompile before running Python tests:

.venv/Scripts/maturin develop

License

Apache 2.0. See LICENSE.

The bundled JavaScript files (viz.js 2.1.2 and svg-pan-zoom 3.6.1) used by visualise_schema_html are MIT-licensed. Licence notices are in src/cifflow/dictionary/js/LICENSES.txt.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cifflow-0.1.2-cp313-cp313-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file cifflow-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cifflow-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cifflow-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c2046fbc514e3ca3f974f583bf485256b84597725d68f982cde1e5bcacf10a5b
MD5 89bbfbfbf8f6bf3b33547bedc636cfb2
BLAKE2b-256 ea77e37733b0ada9d272f74ae0c7c8b97d0f56c750957a72a120422f6f9d5096

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page