Parse, store, validate, and emit Crystallographic Information Files (CIF)

Project description

cifflow

Parse, store, validate, and emit Crystallographic Information Files (CIF).

Python ≥ 3.10 · Apache 2.0 · v0.1.2 · pre-release (not yet on PyPI)

What it does

Parses CIF 1.1 and CIF 2.0 files, including all string types (triple-quoted, multiline text fields, embedded quotes) and save frames
Loads DDLm dictionaries with full _import.get resolution, producing a typed schema
Focused on multi-block powder CIF files
Ingests parsed CIF data into DuckDB using the dictionary-derived schema: one table per category, foreign keys enforced, unknown tags routed to a fallback tier
Emits valid CIF from a populated database in four modes: ORIGINAL, GROUPED, ONE_BLOCK, ALL_BLOCKS
Trusts the user — if you pass in multiple blocks, the program assumes they all belong together and, failing key value clashes, can be interpreted as a single database/experiment
Constructs CifFile objects programmatically from Python values (CifWriter), and performs arbitrary edits: add/remove/rename tags, loops, blocks, and save frames
Removes common parse-time artefacts automatically (clean): orphan error tags, duplicate blocks/save frames/tags, loop padding; for anything beyond these automatic fixes, use CifWriter
Visualises a schema as a Graphviz DOT string or a self-contained interactive HTML file
Returns data as Apache Arrow RecordBatch objects directly from the Rust parser (build_arrow, build_arrow_file)

Key properties

Error-tolerant. The parser never raises on malformed input. Every structural problem produces an explicit error event; parsing continues and all recoverable data is preserved.

No silent data loss. Duplicate tag values are preserved. Tags not mapped by the dictionary go to a fallback table, not a discard pile.

Round-trip fidelity. For well-formed input, emitted CIF re-parses to the same data. All values are stored and emitted as raw strings; ValueType provenance (placeholder . and ? vs quoted equivalents) is preserved throughout.

Canonical caseless names. Block names, save frame names, and tag names are stored in Unicode canonical caseless form (NFC(casefold(NFD(x)))). Lookups are automatically casefolded: cif["ABC"] finds a block stored as "abc".

Streaming parser. The parser is event-driven. CIF source is consumed in a single pass; the IR accumulates events incrementally. The Rust extension provides high-throughput Arrow output without any Python file objects.

Installation

cifflow is not yet on PyPI. Install directly from source:

git clone https://github.com/rowlesmr/cifflow.git
cd cifflow
pip install -e ".[dev]"
.venv/Scripts/maturin develop   # or: maturin develop, if maturin is on PATH

duckdb and pyarrow are declared dependencies and are installed automatically. The Rust extension (cifflow_core) is compiled by maturin during the second step.

Quick start

Parse a CIF file

from cifflow import build

text = open('structure.cif', encoding='utf-8').read()
cif, errors = build(text)   # never raises; errors is a list[ParseError]

for block_name in cif.blocks:          # block names are always lowercase
    block = cif[block_name]
    print(f'{block_name}: {len(block.tags)} tags, {len(block.loops)} loops')

The best way to resolve errors is to inspect the list of errors, edit the file accordingly, and try again. No assumptions are made about how to correct errors automatically.

Full pipeline: dictionary → DuckDB → CIF

import pathlib
from cifflow import (
    DictionaryLoader, directory_resolver,
    save_dictionary, load_dictionary,
    generate_schema,
    build, ingest, emit, EmitMode,
)
from cifflow.types import CifVersion

# 1. Load dictionary (with JSON cache to avoid re-parsing on every run)
cache = pathlib.Path('cif_pow_cache.json')
resolver = directory_resolver('data/dictionaries')
if cache.exists():
    dictionary = load_dictionary(cache)
else:
    dictionary = DictionaryLoader(resolver=resolver).load(
        open('data/dictionaries/cif_pow.dic', encoding='utf-8').read())
    save_dictionary(dictionary, cache)

# 2. Derive schema
schema = generate_schema(dictionary)

# 3. Parse CIF
cif, errors = build(open('all_the_data.cif', encoding='utf-8').read())

# 4. Ingest into an in-memory DuckDB database
#    Pass a file path string to persist: ingest(cif, 'output.db', schema=schema)
conn, warnings = ingest(cif, schema=schema)

# 5. Emit CIF
output = emit(conn, schema, mode=EmitMode.ORIGINAL, version=CifVersion.CIF_2_0)
open('output.cif', 'w', encoding='utf-8').write(output)

See example_workflow.py in the repository root for a fully annotated end-to-end demonstration covering all four emission modes, type-cast export, and fidelity checking.

The full API reference is in docs/api.md.

Architecture

Parser → Event Stream → IR → Dictionary-aware Mapping → DuckDB → Output/API

Layer	Responsibility
Lexer	Tokenisation, `ValueType` assignment
Parser	Token sequence interpretation, error recovery, event emission
IR (CIFModel)	Event accumulation, loop validation, multiline text transformation
Dictionary	DDLm parsing, schema derivation
DuckDB	Persistent storage: structured tables when a dictionary is present, fallback tier otherwise
Output	Valid CIF regeneration; Python/NumPy/pandas API surface

Layer responsibilities are strictly separated. The parser does not know about the dictionary. The dictionary does not know about the IR. The output layer only reads from DuckDB.

Status

All stages are complete and tested.

Stage	Feature
1–2	CIF 1.1 and 2.0 parser + IR (CIF model)
3	DDLm dictionary loading (`_import.get`, alias resolution, deprecation)
4	DuckDB schema generation (Set/Loop → tables, PKs, FKs, bridge columns, fallback tier)
5	DuckDB ingestion: structured tables + fallback tier; FK propagation; error recovery; canonical caseless name matching
6	CIF emission (ORIGINAL, GROUPED, ONE_BLOCK, ALL_BLOCKS); pretty-print; line-length enforcement; decimal alignment; schema visualisation; programmatic `CifFile` construction (`CifWriter`); cleaning parser artefacts (`clean`); type-cast export (`convert_database`); fidelity checking (`check_fidelity`); validation (`validate`)

Development

Run the fast test suite (excludes tests that load large real-world CIF files):

.venv/Scripts/python.exe -m pytest -m "not slow"

Run the full suite including slow tests:

.venv/Scripts/python.exe -m pytest

After modifying the Rust extension, recompile before running Python tests:

.venv/Scripts/maturin develop

License

Apache 2.0. See LICENSE.

The bundled JavaScript files (viz.js 2.1.2 and svg-pan-zoom 3.6.1) used by visualise_schema_html are MIT-licensed. Licence notices are in src/cifflow/dictionary/js/LICENSES.txt.

Project details

Release history Release notifications | RSS feed

0.1.5

May 4, 2026

0.1.4

May 3, 2026

0.1.3

May 3, 2026

This version

0.1.2

May 3, 2026

0.1.1

May 3, 2026

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cifflow-0.1.2-cp313-cp313-win_amd64.whl (1.1 MB view details)

Uploaded May 3, 2026 CPython 3.13Windows x86-64

File details

Details for the file cifflow-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

Download URL: cifflow-0.1.2-cp313-cp313-win_amd64.whl
Upload date: May 3, 2026
Size: 1.1 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cifflow-0.1.2-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`c2046fbc514e3ca3f974f583bf485256b84597725d68f982cde1e5bcacf10a5b`
MD5	`89bbfbfbf8f6bf3b33547bedc636cfb2`
BLAKE2b-256	`ea77e37733b0ada9d272f74ae0c7c8b97d0f56c750957a72a120422f6f9d5096`

See more details on using hashes here.

cifflow 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

cifflow

What it does

Key properties

Installation

Quick start

Parse a CIF file

Full pipeline: dictionary → DuckDB → CIF

Architecture

Status

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes