Another molecular string representation
Project description
AMSR
Another Molecular String Representation, inspired by
- H. Hiz, "A Linearization of Chemical Graphs," J. Chem. Doc. 4, 173-180 (1964)
- SMILES
- PATTY
- DeepSMILES
- SELFIES
Demo
Installing
uv pip install .
uv pip install ".[gpu]" # for GPU support
Usage
import amsr
amsr.ToMol("CNcncc5cNcN6C.oC.o") # caffeine
taxol_smi = "CC1=C2[C@@]([C@]([C@H]([C@@H]3[C@]4([C@H](OC4)C[C@@H]([C@]3(C(=O)[C@@H]2OC(=O)C)C)O)OC(=O)C)OC(=O)c5ccccc5)(C[C@@H]1OC(=O)[C@H](O)[C@@H](NC(=O)c6ccccc6)c7ccccc7)O)(C)C"
amsr.FromSmiles(taxol_smi)
# CccCC`C`C`C'C`OC4.CC'C'6coC`8[OAc].C.O....[OAc].O[Bz].CC`6OcoC`O.C'N[Bz].[Ph]....O.C.C
amsr.FromSmiles(taxol_smi, useGroups=False)
# CccCC`C`C`C'C`OC4.CC'C'6coC`8OcoC..C.O....OcoC..Ococccccc6......CC`6OcoC`O.C'Ncocccccc6......cccccc6.........O.C.C
Description
A molecular string representation in which every sequence of tokens generates a "reasonable" molecule. You may have different ideas about what constitutes a reasonable molecule than this string representation.
Atoms
Atoms are represented by their symbol enclosed in square brackets, as in SMILES. For a one-letter symbol, brackets may be omitted. Atoms are assumed to have a fixed valence that limits the number of covalently-bonded neighbors. If an atom makes fewer bonds than its valence, hydrogens are assumed.
| AMSR | molecule |
|---|---|
| C | methane |
| O | water |
| [Cl] | hydrochloric acid |
Chains
Each atom in a chain is bonded to the most recently added atom that can still make bonds, according to its valence. Hydrogens may be added explicitly like any other atom. In the example below, the fluorines are added to the second carbon; the chlorine is then added to the first carbon, since the second can no longer bond.
| AMSR | molecule |
|---|---|
| CCFFF[Cl] | 2-chloro-1,1,1-trifluoroethane |
Branches
Branches are formed automatically when atoms can no longer
make bonds. They can also be made by "capping" or
"saturating" an atom with hydrogens, using a period .
(capping hydrogens are applied to the most
recently-added atom that can still make bonds).
New atoms will then be bonded to those added earlier, forming a branch.
| AMSR | molecule |
|---|---|
| CCC.C | isobutane |
| CC.CC.C.C | 2,2-dimethylbutane |
Rings
Rings are denoted by a single digit (or two or more digits enclosed in square brackets) giving the size of the ring. A new bond is formed between the two most recently-added atoms that can make bonds and when bonded will form a ring of that size.
| AMSR | molecule |
|---|---|
| CCO3 | oxirane |
| CCCCCC6 | cyclohexane |
| CCCCCCCCCCCC[12] | cyclododecane |
Double bonds (sp2 centers)
Atoms making a double bond are indicated by changing the symbol to lowercase (note that lowercase does not mean "aromatic"; merely, "atom having one fewer neighbor than its valence.") Double bonds are assigned by a matching algorithm. If a perfect matching cannot be found (for instance, in the case of an odd number of contiguous lowercase symbols) a maximal matching is chosen, non-matched atoms remain singly bonded, and hydrogens are added.
| AMSR | molecule |
|---|---|
| co | formaldehyde |
| cccccc6 | benzene |
| cco | acetaldehyde (only one double bond added) |
Note that an oxygen with two neighbors or a nitrogen with three in an aromatic ring is still denoted by a capital (not a lowercase) symbol, although sp2-hybridized, since its coordination number is still equal to its valence.
| AMSR | molecule |
|---|---|
| ccccO5 | furan |
| ccccN5 | pyrrole |
Ring selection
When more than one ring of a given size can be formed, one or more @ signs immediately after
the digit will make ring-forming bonds with atoms appearing earlier in the
string, rather than the most recent.
| AMSR | molecule |
|---|---|
| ccOcc5cccc6 | benzofuran |
| ccOcc5cccc6@ | isobenzofuran |
Triple bonds (sp centers)
Atoms with two fewer neighbors than their valence are designated by a trailing colon :
can make triple bonds (or more than one double bond).
| AMSR | molecule |
|---|---|
| C:N: | hydrogen cyanide |
| oC:o | carbon dioxide |
Hypervalent atoms
Atoms denoted by their symbol alone are assumed to have their lowest possible valence
(for instance, two for sulfur). Higher valences are denoted by one or more exclamation points !.
| AMSR | molecule |
|---|---|
| CSC | dimethyl sulfide |
| Cs!oC | dimethyl sulfoxide |
| S!!FFFFFF | sulfur hexafluoride |
Formal charges, radical electrons, isotopes
Positive/negative formal charges are designated by one or more of +/-.
Radical electrons are denoted by one or more asterisks *. An isotopic mass is denoted by a number prefix
before the atomic symbol (in which case square brackets must be used even for a one-letter symbol).
| AMSR | molecule |
|---|---|
| [Mg++];S!!:ooO-O- | magnesium sulfate |
| CC.C.CCCCC.C.N6O* | (2,2,6,6-Tetramethylpiperidin-1-yl)oxyl or TEMPO |
| [2H]O[2H] | heavy water |
Tetrahedral stereochemistry
Tetrahedral stereochemistry is denoted by a single quote ' meaning "clockwise"
or a backtick ` meaning "counterclockwise," referring to the first three neighbors of
a stereocenter atoms as they appear in the string, with the last neighbor (or implicit hydrogen)
in back.
| AMSR | molecule |
|---|---|
| C`C.FO | (1S)-1-fluoroethanol |
| C'C.FO | (1R)-1-fluoroethanol |
E/Z stereochemistry
Stereochemistry for a double bond is denoted by an underscore _ meaning "trans" or E,
or caret ^ meaning "cis" or Z, between the two atoms making the bond,
where the reference neighboring atoms are those that appear earliest in the string.
| AMSR | molecule |
|---|---|
| c[Br][Cl]_c[Cl] | (E)-1-bromo-1,2-dichloroethene |
| c[Br][Cl]^c[Cl] | (Z)-1-Bromo-1,2-dichloroethene |
Groups
The following abbreviations may be used to represent various functional groups:
(5aN), (5aNbN), (5aNbO), (5aNbS), (5aNcN), (5aNcO), (5aNcS), (5aNdN), (5aNdO), (5aNdS), (5aNeN), (5aNeO), (5aNeS), (5aO), (5aS), (5bN), (5bNaN), (5bNaO), (5bNaS), (5bNcN), (5bNcO), (5bNcS), (5bNdN), (5bNdO), (5bNdS), (5bNeN), (5bNeO), (5bNeS), (5bO), (5bS), (5cN), (5cNaN), (5cNaO), (5cNaS), (5cNbN), (5cNbO), (5cNbS), (5cNdN), (5cNdO), (5cNdS), (5cNeN), (5cNeO), (5cNeS), (5cO), (5cS), (5dN), (5dNaN), (5dNaO), (5dNaS), (5dNbN), (5dNbO), (5dNbS), (5dNcN), (5dNcO), (5dNcS), (5dNeN), (5dNeO), (5dNeS), (5dO), (5dS), (5eN), (5eNaN), (5eNaO), (5eNaS), (5eNbN), (5eNbO), (5eNbS), (5eNcN), (5eNcO), (5eNcS), (5eNdN), (5eNdO), (5eNdS), (5eO), (5eS), (6), (6aN), (6abN), (6abcN), (6abdN), (6abeN), (6abfN), (6acN), (6acdN), (6aceN), (6acfN), (6adN), (6adeN), (6adfN), (6aeN), (6aefN), (6afN), (6bN), (6bcN), (6bcdN), (6bceN), (6bcfN), (6bdN), (6bdeN), (6bdfN), (6beN), (6befN), (6bfN), (6cN), (6cdN), (6cdeN), (6cdfN), (6ceN), (6cefN), (6cfN), (6dN), (6deN), (6defN), (6dfN), (6eN), (6efN), (6fN), [Ac], [Bn], [Boc], [Bz], [CCl3], [CF3], [CHO], [CN], [COO-], [COOEt], [COOH], [COOMe], [Cbz], [Cy], [Et], [Ms], [NC], [NHAc], [NHMe], [NMe2], [NO2], [OAc], [OEt], [OMe], [OiBu], [PO3], [Ph], [Piv], [SMe], [SO3], [Tf], [Tol], [Ts], [iBu], [iPr], [nBu], [nDec], [nHept], [nHex], [nNon], [nOct], [nPent], [nPr], [sBu], [tBu]
| AMSR | molecule |
|---|---|
| N+C'[Bn][COO-] | L-phenylalanine |
| cccccc6[iBu]..CC.[COOH] | ibuprofen |
| C[Ph][Et]coNcoNco6 | phenobarbital |
Multiple molecules
More than one molecule may be specified by separating with ;.
| AMSR | molecule |
|---|---|
| ocC.O(6)[COOH];CcoN(6)..O | aspirin and acetaminophen |
Developing
uv pip install ".[dev]"
pre-commit install
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amsr-0.1.6.tar.gz.
File metadata
- Download URL: amsr-0.1.6.tar.gz
- Upload date:
- Size: 83.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8ecd887638d144b258c6114eddb67ec65a70fd45bfadbcd1242f7d2564405f8
|
|
| MD5 |
131bb82dc87017b08b3f8028e49ef167
|
|
| BLAKE2b-256 |
6a3bebac89ac510518239360b1c0c3a70ff0b6bd3b60a9a6009a793fd191c023
|
File details
Details for the file amsr-0.1.6-py3-none-any.whl.
File metadata
- Download URL: amsr-0.1.6-py3-none-any.whl
- Upload date:
- Size: 44.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51697fac02b2f7241340ec50c69bede6d68c1d0a169c91ce24783283bfdd706c
|
|
| MD5 |
31588838b61f654684c1a5703cee60d7
|
|
| BLAKE2b-256 |
4ea5017c79881cb581766ab8bfbb7bd562aa4719bf081e60c833b5a00ce5c990
|