Skip to main content

No project description provided

Project description

Scott

Scott computes a canonical tree representative of any labelled (edge and node) graph, derivable to a canonical trace (string) or canonical adjacency matrix.

Unlike classical isomorphism tools (nauty, bliss, Traces), Scott natively handles both node labels and edge labels, making it particularly well suited for chemical graphs and any domain where edge colors carry semantic meaning.

Table of contents

This repository aims to be concise and practical. For background on the graph isomorphism problem and algorithm details, see the project page.

When to use Scott

Scott's canonical trace is a reusable, serializable fingerprint for a graph. This makes it the right tool when you need to:

  • Work with edge-colored graphs — bond orders, relation types, weighted edges. Tools like nauty/bliss/Traces ignore edge labels by default; Scott treats them as first-class citizens.
  • Index or deduplicate a collection of graphs — store the trace as a string key in a database, dictionary, or search index. Isomorphic graphs always produce the exact same trace, so a lookup replaces a full isomorphism test at query time.
  • Normalize inputs for machine learning — canonical adjacency matrices put isomorphic graphs in the same form, which stabilizes downstream models.

When Scott is not the right choice: if you only need a one-off pairwise isomorphism test on large, unlabeled graphs, dedicated solvers like nauty or bliss will be faster. Scott's overhead comes from producing a fully reusable canonical object rather than a single yes/no answer.


Getting started

See docs/Installation.md for a complete install guide (PyPI + local).

From Pypi

pip install scott

# optional extras
pip install 'scott[rdkit,nx]'

Available extras:

  • rdkit: parse molecular input (SMILES, Mol files)
  • nx: NetworkX integration
  • legacy: pure-Python implementation (slower, no Rust required)
  • dev: development tools

Notes:

  • The Rust backend is the default and requires a Rust toolchain when building from source (CPython only).
  • A pure-Python fallback is available via SCOTT_BACKEND=legacy.

From Docker

The default image is based on python:latest, builds the Rust extension, and installs all extras (rdkit, dev tools):

# Build the default image (CPython + Rust backend)
docker build -t scott .

# Run an interactive shell
docker run --rm -it scott

Alternative Dockerfiles are available under dockerfiles/.

Jupyter

CPython + Rust backend with a Jupyter notebook server.

docker build -t scott:jupyter -f dockerfiles/jupyter/Dockerfile .
docker run --rm -it -p 8888:8888 scott:jupyter

PyPy (standalone)

PyPy images use the pure-Python legacy backend (SCOTT_BACKEND=legacy) since PyPy cannot build PyO3 extensions.

A PyPy-based image with ipython. Useful for benchmarking the legacy backend on a JIT interpreter.

docker build -t scott:pypy -f dockerfiles/pypy/Dockerfile .
docker run --rm -it scott:pypy
# > ipython

PyPy (Jupyter)

Same as above, with a Jupyter notebook server.

docker build -t scott:pypy-jupyter -f dockerfiles/pypy-jupyter/Dockerfile .
docker run --rm -it -p 8888:8888 scott:pypy-jupyter

From source code

Simply clone the repo in a local repertory

# get the code
git clone https://github.com/theplatypus/scott.git
cd ./scott

# create a virtualenv (uv)
uv venv
source .venv/bin/activate
uv pip install -e .

# build the Rust extension (requires a Rust toolchain)
uv run maturin develop --release

# optional extras
uv pip install -e '.[rdkit]'  # SMILES parsing via RDKit
import scott as st
print(st.__version__)

A pure-Python fallback is available for environments where building the Rust extension is not possible:

SCOTT_BACKEND=legacy python3 script.py

Usage

For extended examples see usage.py and usage_advanced.py.

Build a Graph

Scott graphs carry labels on both nodes and edges. The modality parameter on an edge is its color/weight — this is what distinguishes Scott from tools that only support vertex labels.

import scott as st

# A small molecule-like graph: C connected to O by a double bond,
# and to two H by single bonds.
# Edge modality encodes bond order (1 = single, 2 = double).

graph = st.structs.graph.Graph()

n1 = st.structs.node.Node("1", "C")
n2 = st.structs.node.Node("2", "O")
n3 = st.structs.node.Node("3", "H")
n4 = st.structs.node.Node("4", "H")

e1 = st.structs.edge.Edge("1", n1, n2, modality=2)  # double bond
e2 = st.structs.edge.Edge("2", n1, n3)              # single bond
e3 = st.structs.edge.Edge("3", n1, n4)              # single bond

graph.add_node(n1)
graph.add_nodes([n2, n3, n4])
graph.add_edge(e1)
graph.add_edge(e2)
graph.add_edge(e3)

print(graph)

Edge labels matter: two graphs that are identical except for one bond order will produce different canonical traces, correctly distinguishing them as non-isomorphic.

Import Graphs

Scott is also able to parse a few graph formats files. Note that a parsing function always returns a list, even if there is one molecule in the file.

# Parse a .sdf file (chemical file standard) :

# from file path
compounds = st.parse.from_sdf(file_path='./data/molecule/cafeine.sdf')
cafeine = compounds[0]
print(cafeine)

# ignore hydrogens
compounds = st.parse.from_sdf(file_path='./data/molecule/simple.sdf', ignore_hydrogens=True)
simple_without_H = compounds[0]
print(simple_without_H)

# Parse a SMILES string (requires: pip install scott[rdkit])
smile = st.parse.parse_smiles('CCOCOCC=CCONC')

# we can iterate over graph vertices
for id_node in smile.V :
	print("Node #%s : %s" % (str(id_node), str(smile.V[id_node].label)))

# Parse a .dot file
cfi1 = st.parse.from_dot(file_path='./data/isotest/cfi-rigid-t2-dot/cfi-rigid-t2-0016-04-1.dot')[0]

# Parse a .dimacs file
cfi2 = st.parse.from_dimacs(file_path='./data/isotest/cfi-rigid-t2/cfi-rigid-t2-0016-04-1')[0]

# Parse a PubChem XML file
compounds = st.parse.from_pubchem_xml(file_path='./data/molecule/simple.xml')

# Convert a NetworkX graph 
import networkx as nx
nxg = nx.Graph()
nxg.add_node("a", label="C")
nxg.add_node("b", label="O")
nxg.add_edge("a", "b", weight=2)
g = st.parse.from_networkx(nxg)

G = st.parse.from_networkx(nx.random_geometric_graph(200, 0.125, seed=896803))

Canonical traces

A canonical trace is a string that uniquely identifies an isomorphism class. Two graphs are isomorphic if and only if their traces are equal. Because it is a plain string, it can be stored, compared, hashed, or indexed like any other string — with no graph library required on the reading side.

import scott as st

simple = st.parse.from_pubchem_xml(file_path='./data/molecule/simple.xml')[0]

# to_cgraph returns a CGraph object — str() gives the trace string
simple_cgraph = st.canonize.to_cgraph(simple)
simple_trace  = str(simple_cgraph)

assert simple_trace == '(H:1, H:1, (H:1, H:1, ((((C:1).#2{$1}:2)C:2)C:2, ((((C:1).#2{$1}:2)C:2)C:1).#2{$2}:1)C:1)C:1, (O:2, (((((C:1).#2{$1}:2)C:2)C:1).#2{$2}:1)O:1)C:1)C'

# SMILES (requires rdkit extra)
mol = st.parse.parse_smiles('CCOCOCC=CCONC', ignore_hydrogens=True)
assert str(st.canonize.to_cgraph(mol)) == '(C:1, (((((((((C:1)N:1)O:1)C:1)C:2)C:1)C:1)O:1)C:1)O:1)C'

Because the trace is a plain string, any standard hash function applies directly:

import hashlib

fingerprint = hashlib.sha256(simple_trace.encode()).hexdigest()
# deterministic across runs, machines, and Python versions

Pairwise isomorphism — compute once per graph, then compare strings:

G = st.parse.from_dimacs(file_path='./data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-01-1')[0]
H = st.parse.from_dimacs(file_path='./data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-01-2')[0]
E = st.parse.from_dimacs(file_path='./data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-02-1')[0]
F = st.parse.from_dimacs(file_path='./data/isotest/cfi-rigid-t2/cfi-rigid-t2-0020-02-2')[0]

Gc, Hc, Ec, Fc = [st.canonize.to_cgraph(x) for x in (G, H, E, F)]

assert str(Gc) == str(Hc)   # G and H are isomorphic
assert str(Ec) == str(Fc)   # E and F are isomorphic
assert str(Gc) != str(Ec)   # but the two classes are distinct

Indexing a graph database

The canonical trace is a stable, serializable key. Store it in a dictionary, a SQL column, or any key-value store — querying then costs a single lookup instead of running a new isomorphism test for each candidate.

import scott as st
import hashlib

def graph_key(g):
	"""Deterministic string key for any labelled graph."""
	return str(st.canonize.to_cgraph(g))

# --- Build an index offline (once per corpus) ---

index = {}  # trace -> list of graph IDs; swap for any DB

corpus = [
	("mol_001", st.parse.from_sdf(file_path='./data/molecule/cafeine.sdf')[0]),
	("mol_002", st.parse.from_sdf(file_path='./data/molecule/simple.sdf')[0]),
	# ... thousands more
]

for gid, graph in corpus:
	key = graph_key(graph)
	index.setdefault(key, []).append(gid)

# All entries with more than one ID are duplicate structures
duplicates = {k: v for k, v in index.items() if len(v) > 1}

# --- Query at runtime (no isomorphism test needed) ---

query = st.parse.parse_smiles('CN1C=NC2=C1C(=O)N(C(=O)N2C)C')  # caffeine via SMILES
matches = index.get(graph_key(query), [])
print("Identical structures in corpus:", matches)

For a compact fixed-size key (SQL index, Bloom filter, embedding store), hash the trace:

def graph_hash(g, algorithm='sha256'):
	return hashlib.new(algorithm, graph_key(g).encode()).hexdigest()

# Same hash  <=>  same isomorphism class (collision probability negligible)
assert graph_hash(G) == graph_hash(H)
assert graph_hash(G) != graph_hash(E)

Canonical Adjacency Matrices

On a graph G of N vertices, an adjacency matrix is an N×N array describing pairwise connectivity. Without a canonical vertex ordering, two isomorphic graphs yield different matrices. Scott's canonical ordering fixes this — particularly useful for graph neural networks and other learning pipelines where the same structure should map to the same input tensor.

import scott as st

g = st.parse.from_dot(file_path="./data/isotest/cfi-rigid-t2-dot/cfi-rigid-t2-0020-02-2.dot")[0]
h = st.parse.from_dot(file_path="./data/isotest/cfi-rigid-t2-dot/cfi-rigid-t2-0020-02-1.dot")[0]

# raw matrices differ because vertex order is arbitrary
assert g.adjacency_matrix() != h.adjacency_matrix()

# canonical matrices are identical for isomorphic graphs
assert g.adjacency_matrix(canonic=True) == h.adjacency_matrix(canonic=True)

Testing

# run all tests
python -m pytest test/pytest/ -v

# unit tests only
python -m pytest test/pytest/ -v -m unit

# canonization benchmarks
python -m pytest test/pytest/ -v -m canonization

Cross-engine comparison via the unified test runner:

python3 test/cli/test_runner.py --interactive
python3 test/cli/test_runner.py validity --engine rs
python3 test/cli/test_runner.py cfi-rigid --engine rs --release -n 30

Results are written under results/. For more detail, see docs/Testing.md.


Citation

If you use or fork scott in further works, please cite:

@inproceedings{bloyet2019scott,
  title={Scott: A method for representing graphs as rooted trees for graph canonization},
  author={Bloyet, Nicolas and Marteau, Pierre-Fran{\c{c}}ois and Frenod, Emmanuel},
  booktitle={International Conference on Complex Networks and Their Applications},
  pages={578--590},
  year={2019},
  organization={Springer}
}

Licence

Available under the MIT licence. See the GitHub repo.

Written and developed by:

logos

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scott_trace-1.1.2.tar.gz (8.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scott_trace-1.1.2-cp38-abi3-win_amd64.whl (297.0 kB view details)

Uploaded CPython 3.8+Windows x86-64

scott_trace-1.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451.5 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

scott_trace-1.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (436.2 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

scott_trace-1.1.2-cp38-abi3-macosx_11_0_arm64.whl (400.3 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file scott_trace-1.1.2.tar.gz.

File metadata

  • Download URL: scott_trace-1.1.2.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scott_trace-1.1.2.tar.gz
Algorithm Hash digest
SHA256 2fb8678faa918c32dd863ad822aa8ebc6282fa963255d884fdc9296d2fd84c08
MD5 61f2d7c887ce30ad2ffbeec23e7701a0
BLAKE2b-256 fc61e72d414fd0b3b23cfd7731e120a41fd871ea965dd919d7bb4f32cc6f6bc8

See more details on using hashes here.

Provenance

The following attestation bundles were made for scott_trace-1.1.2.tar.gz:

Publisher: release.yml on theplatypus/scott

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scott_trace-1.1.2-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: scott_trace-1.1.2-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 297.0 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scott_trace-1.1.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8a02338e7361a7500bc12691cef93af5ee488ff50c14d66af402fd748fc857c3
MD5 acfcdc3bc704870574891106cc711b40
BLAKE2b-256 b15b08f5a7eb9bb73d3ef2b7d81934f82f5c0e5497f39deafb7387923d61780e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scott_trace-1.1.2-cp38-abi3-win_amd64.whl:

Publisher: release.yml on theplatypus/scott

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scott_trace-1.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scott_trace-1.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d2c472d13d607a226894b9330e695dfd9e4fd465508fd6421c042da9eef558a6
MD5 34f9c0bc0b5df917ebf6d5a4e274738f
BLAKE2b-256 2d82dc7afb65da309d45aa2e797f1ac7db59ede436369ac1baa01ad5f4cbadfc

See more details on using hashes here.

Provenance

The following attestation bundles were made for scott_trace-1.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on theplatypus/scott

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scott_trace-1.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for scott_trace-1.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 96b405dee66b64ac2f84cae57dc7a3b5c9be627eb40e70dcb83ecb1ad885fada
MD5 b1ffc8c9bf2cf42f8762ba1991092b82
BLAKE2b-256 a4a02dc46fe7fbc640fbb427f53392f272b39a6f82a61f2cf14b604240656553

See more details on using hashes here.

Provenance

The following attestation bundles were made for scott_trace-1.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on theplatypus/scott

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scott_trace-1.1.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scott_trace-1.1.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 60373b58aaedaf030ad56f84e3d5aa18b166a6abe201c619b14f98570f2284ea
MD5 730b104e2ae2f6753329131617775418
BLAKE2b-256 844faf355712d48f36aaec702c8ab6e9a81df36eb57803b4e3b935176b4cb18a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scott_trace-1.1.2-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on theplatypus/scott

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page