Skip to main content

Recover RDKit molecules, charges, SMILES, InChI, and InChIKey from XYZ coordinates with candidate charge inference.

Project description

xyzrecover

xyzrecover is a small RDKit-based package for recovering molecular identifiers from XYZ coordinate files when the bond graph and charge were lost.

It is designed for batches of XYZ files that may contain one molecule, multiple disconnected covalent molecules, or multi-frame XYZ blocks. It uses RDKit's native rdDetermineBonds / xyz2mol implementation to infer connectivity and bond orders, then emits canonical SMILES, explicit-H SMILES, formal charge, formula, InChI, and InChIKey.

Important limitation

XYZ files do not contain bond orders, formal charges, spin state, resonance choice, protonation intent, or non-covalent association metadata. For some geometries, more than one chemically valid assignment is possible. xyzrecover therefore searches candidate charges, scores the successful assignments, and includes alternates in the output instead of pretending the answer is always unique.

When you know the total charge, pass it. That is the single most useful constraint.

RDKit's valence model only handles main-group organic elements. Components containing metals or noble gases are reported with status="unsupported" and smiles=None rather than perceived, because RDKit would otherwise return a silently wrong structure at high confidence (a lone Na⁺ comes back as [NaH]). The supported set is DEFAULT_SUPPORTED_ELEMENTS; override it via PerceptionConfig(supported_elements=...), or disable the check with restrict_to_supported_elements=False (CLI: --no-element-check).

The one exception is ionic metal counterions: an isolated group 1 / group 2 metal atom (Na, K, Li, Ca, Mg, …) is recovered directly as its ion ([Na+], [Ca+2]) using the group oxidation state, bypassing RDKit. Such atoms are split out of any short ionic contact first, so a salt like Na⁺·methyl-phosphate⁻ recovers the metal as [Na+] and perceives the anion separately, and (when you pass total_charge) the charge is balanced across the ion and the anion. This is on by default; disable with recover_ionic_metals=False (CLI: --no-ionic-metals). Transition metals and multi-atom metal fragments are still reported unsupported.

Install

python -m pip install rdkit
python -m pip install -e .

Conda also works well for RDKit:

conda install -c conda-forge rdkit
python -m pip install -e .

The package installs a xyzrecover console script; you can also run it as a module with python -m xyzrecover.

Charge options apply to every block. --total-charge and --per-fragment-charges are applied to every XYZ block in every input file. Use one charge state per invocation for heterogeneous batches.

CLI examples

Unknown charge, try charge states -3 through +3:

xyzrecover molecule.xyz --json results.json --csv results.csv --sdf results.sdf

Known total charge for each XYZ block:

xyzrecover acetate.xyz --total-charge -1

A salt or cluster with multiple disconnected components and known total charge:

xyzrecover ion_pair.xyz --total-charge 0 --candidate-charges=-2,-1,0,1,2

Known per-component charges. The order follows connectivity perception (the same order as molecule_index in the output, not the input atom order), so run once unconstrained first to see how the structure splits:

xyzrecover fragments.xyz --per-fragment-charges=1,-1,0

If an asserted charge can't be satisfied for a component, fall back to the candidate charge search for that component instead of failing it:

xyzrecover molecule.xyz --total-charge 2 --charge-fallback

Use extended Hueckel connectivity if your RDKit build supports YAeHMOP:

xyzrecover molecule.xyz --use-hueckel --total-charge 1

Python API

from xyzrecover import PerceptionConfig, recover_xyz_file, records_to_json

config = PerceptionConfig(
    total_charge=-1,                  # optional but recommended when known
    candidate_charges=(-2, -1, 0, 1), # used when fragment charge is unknown
    split_fragments=True,
)

records = recover_xyz_file("input.xyz", config)
for rec in records:
    print(rec.molecule_index, rec.formula, rec.charge, rec.smiles, rec.inchikey, rec.confidence)

print(records_to_json(records))

Each record contains:

  • status: ok, failed (RDKit could not assign bonds), or unsupported (component contains an element RDKit cannot handle)
  • charge: RDKit formal charge for the selected assignment
  • smiles: canonical SMILES, with hydrogens made implicit when possible
  • explicit_h_smiles: canonical SMILES retaining explicit hydrogens from the XYZ
  • inchi and inchikey
  • confidence: high, medium, low, or none
  • block_index / molecule_index: which XYZ block (frame) and which recovered component the record came from
  • atom_indices: indices into the source block's atoms for this component
  • alternates: other successful charge/bond-order assignments, when present
  • warnings / errors: notes and failed charge attempts, useful for debugging

Multi-frame XYZ

Each XYZ block is processed independently, so a multi-frame file (e.g. a trajectory of one molecule) yields one set of records per frame, distinguished by block_index. De-duplicate on inchikey downstream if you only want unique species.

Recommended workflow

  1. Run with known total charge whenever available.
  2. Inspect any record with confidence != "high" or non-empty alternates.
  3. For unusual bonding, radicals, metals, transition states, hypervalent species, or very close non-covalent contacts, validate against quantum chemistry output or another bond-perception method.
  4. Keep the SDF output because it preserves the recovered 3D coordinates and the inferred bond graph.

What to expect (benchmarks)

Recovery rate and confidence depend strongly on the chemistry. Measured across several datasets (COMP6, ANI-1x/ANI-1xm, GDB, peptides, H-bonded and ion-pair complexes):

Regime Perceived Typical confidence How to handle
Neutral organics (COMP6, ANI-1xm, GDB, peptides) 98–100% high Auto-accept. Averaging over multiple conformers recovers strained cases (→99.8%).
Neutral H-bonded complexes ~100% (matches OpenBabel) high Use split_fragments=True (default).
Proton-transfer / ion-pair complexes recovered medium Genuinely ambiguous (where is the proton?); ≥ OpenBabel. Gate on confidence and inspect alternates.
Charged organic ions, charge unknown ~86% lower Declare the charge (--total-charge) to push to high; otherwise gate.
Group 1/2 metal counterions (Na, K, Li, Ca, Mg) lone ions recovered high Recovered as [Na+]/[Ca+2] via the ionic fast-path; pass total_charge to balance the anion.
Transition metals / multi-atom metal fragments not supported none Reported as status="unsupported", smiles=None. Use a metal-aware tool (e.g. xyz2mol_tm, xtb WBO).

Practical gating rule: auto-accept status="ok" with confidence="high" and empty alternates; route everything else (medium/low, non-empty alternates, failed, unsupported) to review or a second method.

Development

Install the dev extras (pytest, ruff, mypy) and run the checks:

python -m pip install -e ".[dev]"
python -m pytest        # tests
ruff check .            # lint
ruff format .           # format

CI runs ruff (lint + format check) and pytest on Python 3.10–3.12.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xyzrecover-0.1.0.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xyzrecover-0.1.0-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file xyzrecover-0.1.0.tar.gz.

File metadata

  • Download URL: xyzrecover-0.1.0.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xyzrecover-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7332a357e62d5c4879af95146acd7799eb228373c8569b9de1540de73a47cc98
MD5 feadbd0211f5c02bef37a7d09b11f62a
BLAKE2b-256 16513ec5f226dd1648e31e67a652143b2d650842e06d02e48b4f6241e098179b

See more details on using hashes here.

Provenance

The following attestation bundles were made for xyzrecover-0.1.0.tar.gz:

Publisher: release.yml on piskuliche/xyzrecover

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xyzrecover-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xyzrecover-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xyzrecover-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fdffba911cf85f7c3db62e2bab3d33333bf4e744ad483f3e21dc659ae81938c7
MD5 13f970b7505f58c0bb6b0bfd839fb869
BLAKE2b-256 7d5c7dca5358b61dbe4c4e60a0978289cfce95fd5b8b6d07603ba5d1ad6d86cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for xyzrecover-0.1.0-py3-none-any.whl:

Publisher: release.yml on piskuliche/xyzrecover

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page