Skip to main content

Another molecular string representation

Project description

AMSR

Another Molecular String Representation, inspired by

Demo

amsr.harrystern.org

Installing

uv pip install .
uv pip install ".[gpu]" # for GPU support

Usage

import amsr

amsr.ToMol("CNcncc5cNcN6C.oC.o") # caffeine

caffeine

taxol_smi = "CC1=C2[C@@]([C@]([C@H]([C@@H]3[C@]4([C@H](OC4)C[C@@H]([C@]3(C(=O)[C@@H]2OC(=O)C)C)O)OC(=O)C)OC(=O)c5ccccc5)(C[C@@H]1OC(=O)[C@H](O)[C@@H](NC(=O)c6ccccc6)c7ccccc7)O)(C)C"

amsr.FromSmiles(taxol_smi)
# CccCC`C`C`C'C`OC4.CC'C'6coC`8[OAc].C.O....[OAc].O[Bz].CC`6OcoC`O.C'N[Bz].[Ph]....O.C.C

amsr.FromSmiles(taxol_smi, useGroups=False)
# CccCC`C`C`C'C`OC4.CC'C'6coC`8OcoC..C.O....OcoC..Ococccccc6......CC`6OcoC`O.C'Ncocccccc6......cccccc6.........O.C.C

Description

A molecular string representation in which every sequence of tokens generates a "reasonable" molecule. You may have different ideas about what constitutes a reasonable molecule than this string representation.

Atoms

Atoms are represented by their symbol enclosed in square brackets, as in SMILES. For a one-letter symbol, brackets may be omitted. Atoms are assumed to have a fixed valence that limits the number of covalently-bonded neighbors. If an atom makes fewer bonds than its valence, hydrogens are assumed.

AMSR molecule
C methane
O water
[Cl] hydrochloric acid

Chains

Each atom in a chain is bonded to the most recently added atom that can still make bonds, according to its valence. Hydrogens may be added explicitly like any other atom. In the example below, the fluorines are added to the second carbon; the chlorine is then added to the first carbon, since the second can no longer bond.

AMSR molecule
CCFFF[Cl] 2-chloro-1,1,1-trifluoroethane

Branches

Branches are formed automatically when atoms can no longer make bonds. They can also be made by "capping" or "saturating" an atom with hydrogens, using a period . (capping hydrogens are applied to the most recently-added atom that can still make bonds). New atoms will then be bonded to those added earlier, forming a branch.

AMSR molecule
CCC.C isobutane
CC.CC.C.C 2,2-dimethylbutane

Rings

Rings are denoted by a single digit (or two or more digits enclosed in square brackets) giving the size of the ring. A new bond is formed between the two most recently-added atoms that can make bonds and when bonded will form a ring of that size.

AMSR molecule
CCO3 oxirane
CCCCCC6 cyclohexane
CCCCCCCCCCCC[12] cyclododecane

Double bonds (sp2 centers)

Atoms making a double bond are indicated by changing the symbol to lowercase (note that lowercase does not mean "aromatic"; merely, "atom having one fewer neighbor than its valence.") Double bonds are assigned by a matching algorithm. If a perfect matching cannot be found (for instance, in the case of an odd number of contiguous lowercase symbols) a maximal matching is chosen, non-matched atoms remain singly bonded, and hydrogens are added.

AMSR molecule
co formaldehyde
cccccc6 benzene
cco acetaldehyde (only one double bond added)

Note that an oxygen with two neighbors or a nitrogen with three in an aromatic ring is still denoted by a capital (not a lowercase) symbol, although sp2-hybridized, since its coordination number is still equal to its valence.

AMSR molecule
ccccO5 furan
ccccN5 pyrrole

Ring selection

When more than one ring of a given size can be formed, one or more @ signs immediately after the digit will make ring-forming bonds with atoms appearing earlier in the string, rather than the most recent.

AMSR molecule
ccOcc5cccc6 benzofuran
ccOcc5cccc6@ isobenzofuran

Triple bonds (sp centers)

Atoms with two fewer neighbors than their valence are designated by a trailing colon : can make triple bonds (or more than one double bond).

AMSR molecule
C:N: hydrogen cyanide
oC:o carbon dioxide

Hypervalent atoms

Atoms denoted by their symbol alone are assumed to have their lowest possible valence (for instance, two for sulfur). Higher valences are denoted by one or more exclamation points !.

AMSR molecule
CSC dimethyl sulfide
Cs!oC dimethyl sulfoxide
S!!FFFFFF sulfur hexafluoride

Formal charges, radical electrons, isotopes

Positive/negative formal charges are designated by one or more of +/-. Radical electrons are denoted by one or more asterisks *. An isotopic mass is denoted by a number prefix before the atomic symbol (in which case square brackets must be used even for a one-letter symbol).

AMSR molecule
[Mg++];S!!:ooO-O- magnesium sulfate
CC.C.CCCCC.C.N6O* (2,2,6,6-Tetramethylpiperidin-1-yl)oxyl or TEMPO
[2H]O[2H] heavy water

Tetrahedral stereochemistry

Tetrahedral stereochemistry is denoted by a single quote ' meaning "clockwise" or a backtick ` meaning "counterclockwise," referring to the first three neighbors of a stereocenter atoms as they appear in the string, with the last neighbor (or implicit hydrogen) in back.

AMSR molecule
C`C.FO (1S)-1-fluoroethanol
C'C.FO (1R)-1-fluoroethanol

E/Z stereochemistry

Stereochemistry for a double bond is denoted by an underscore _ meaning "trans" or E, or caret ^ meaning "cis" or Z, between the two atoms making the bond, where the reference neighboring atoms are those that appear earliest in the string.

AMSR molecule
c[Br][Cl]_c[Cl] (E)-1-bromo-1,2-dichloroethene
c[Br][Cl]^c[Cl] (Z)-1-Bromo-1,2-dichloroethene

Groups

The following abbreviations may be used to represent various functional groups:

(5aN), (5aNbN), (5aNbO), (5aNbS), (5aNcN), (5aNcO), (5aNcS), (5aNdN), (5aNdO), (5aNdS), (5aNeN), (5aNeO), (5aNeS), (5aO), (5aS), (5bN), (5bNaN), (5bNaO), (5bNaS), (5bNcN), (5bNcO), (5bNcS), (5bNdN), (5bNdO), (5bNdS), (5bNeN), (5bNeO), (5bNeS), (5bO), (5bS), (5cN), (5cNaN), (5cNaO), (5cNaS), (5cNbN), (5cNbO), (5cNbS), (5cNdN), (5cNdO), (5cNdS), (5cNeN), (5cNeO), (5cNeS), (5cO), (5cS), (5dN), (5dNaN), (5dNaO), (5dNaS), (5dNbN), (5dNbO), (5dNbS), (5dNcN), (5dNcO), (5dNcS), (5dNeN), (5dNeO), (5dNeS), (5dO), (5dS), (5eN), (5eNaN), (5eNaO), (5eNaS), (5eNbN), (5eNbO), (5eNbS), (5eNcN), (5eNcO), (5eNcS), (5eNdN), (5eNdO), (5eNdS), (5eO), (5eS), (6), (6aN), (6abN), (6abcN), (6abdN), (6abeN), (6abfN), (6acN), (6acdN), (6aceN), (6acfN), (6adN), (6adeN), (6adfN), (6aeN), (6aefN), (6afN), (6bN), (6bcN), (6bcdN), (6bceN), (6bcfN), (6bdN), (6bdeN), (6bdfN), (6beN), (6befN), (6bfN), (6cN), (6cdN), (6cdeN), (6cdfN), (6ceN), (6cefN), (6cfN), (6dN), (6deN), (6defN), (6dfN), (6eN), (6efN), (6fN), [Ac], [Bn], [Boc], [Bz], [CCl3], [CF3], [CHO], [CN], [COO-], [COOEt], [COOH], [COOMe], [Cbz], [Cy], [Et], [Ms], [NC], [NHAc], [NHMe], [NMe2], [NO2], [OAc], [OEt], [OMe], [OiBu], [PO3], [Ph], [Piv], [SMe], [SO3], [Tf], [Tol], [Ts], [iBu], [iPr], [nBu], [nDec], [nHept], [nHex], [nNon], [nOct], [nPent], [nPr], [sBu], [tBu]
AMSR molecule
N+C'[Bn][COO-] L-phenylalanine
cccccc6[iBu]..CC.[COOH] ibuprofen
C[Ph][Et]coNcoNco6 phenobarbital

Multiple molecules

More than one molecule may be specified by separating with ;.

AMSR molecule
ocC.O(6)[COOH];CcoN(6)..O aspirin and acetaminophen

Developing

uv pip install ".[dev]"
pre-commit install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amsr-0.1.6.tar.gz (83.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amsr-0.1.6-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file amsr-0.1.6.tar.gz.

File metadata

  • Download URL: amsr-0.1.6.tar.gz
  • Upload date:
  • Size: 83.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amsr-0.1.6.tar.gz
Algorithm Hash digest
SHA256 a8ecd887638d144b258c6114eddb67ec65a70fd45bfadbcd1242f7d2564405f8
MD5 131bb82dc87017b08b3f8028e49ef167
BLAKE2b-256 6a3bebac89ac510518239360b1c0c3a70ff0b6bd3b60a9a6009a793fd191c023

See more details on using hashes here.

File details

Details for the file amsr-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: amsr-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 44.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amsr-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 51697fac02b2f7241340ec50c69bede6d68c1d0a169c91ce24783283bfdd706c
MD5 31588838b61f654684c1a5703cee60d7
BLAKE2b-256 4ea5017c79881cb581766ab8bfbb7bd562aa4719bf081e60c833b5a00ce5c990

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page