Topology-native explainable language model prototype powered by Topologist

Project description

TopoLM

A topology-native, explainable language model prototype powered by topologist.

ChatGPT Image Jun 7, 2026, 11_38_36 AM

Python Downloads License

TopoLM combines:

Topology-native graph memory using topologist and NetworkX.
Hyperdimensional encoding for unit, domain, and sentence representations.
Evidence-based candidate retrieval from phrase continuations, direct edges, and retrieved contexts.
Explainable scoring with breakdowns of evidence, domain match, POS grammar, and repetition penalties.
Generation with multiple decoding strategies (nucleus, beam, greedy) and phrase-tail detection.
Hugging Face dataset support for training on large text corpora.
Persistence with full state save/load, graph serialization, and memory reconstruction.

Why Topology for Language Models?

Most neural LMs are opaque black boxes. Most symbolic systems are brittle and limited.

TopoLM sits between:

Input text
  -> Tokenize & domain detect
  -> Build symbolic graph (units, phrases, domains, POS)
  -> HDC encoding for each node
  -> Topological memory state
  
  -> Inference (next-token prediction, generation)
  -> Explainable evidence trails
  -> Drift detection & refinement

Each token, phrase, and domain relationship is stored explicitly in the graph, encoded into a high-dimensional bipolar vector, and scored by evidence, topology, and confidence. This gives you a language model that is:

Interpretable: see exactly why a prediction was made.
Grounded: graph structure prevents nonsense outputs.
Efficient: no matrix multiplications; graph queries and HDC similarity.
Debuggable: modify graph state, track provenance, refine confidence.

Architecture

Text input
    |
    v
Tokenizer (unit, POS, domain, entity recognition)
    |
    v
Graph builder
  - Unit nodes (with frequency, domain, POS)
  - Phrase nodes (with multi-gram spans)
  - Domain nodes
  - Relations (next_unit, appears_near, likely_next, domain_related, has_pos)
    |
    v
HDC Memory (Topologist + fallback NetworkX)
  - Encode units, phrases, domains, positions into {-1,+1}^D vectors
  - Store graph topology
  - Bundled snapshots for drift
    |
    v
Inference (Predict or Generate)
  - Context Index (HDC similarity retrieval)
  - Candidate retrieval (phrase continuation, direct edges, domain priors, unigrams)
  - Evidence scoring (weighted by source: phrase, direct, RAG, domain, frequency)
  - Grammar validation (POS sequences)
  - Sampling (nucleus, beam, greedy)

Install

pip install topolm

For Hugging Face dataset support:

pip install topolm[hf]

For development:

pip install -e ".[dev]"
pytest -q
python -m build
twine check dist/*

Quick Start

Basic Training and Prediction

from topolm import TopoLM, Config

corpus = """
The cat sat on the mat.
The dog sat on the floor.
CYP3A4 inhibition increases drug exposure.
Clarithromycin inhibits CYP3A4.
"""

model = TopoLM(Config()).fit(corpus)

# Get next-token predictions
preds = model.distribution("clarithromycin inhibits", top_k=5)
for p in preds:
    print(f"  {p.text:20s} prob={p.probability:.3f} score={p.score:.3f}")

# Generate fluent text
generated = model.generate("cyp3a4 inhibition", decoding="beam")
print(generated)

Training from Text List

texts = [
    "Sentence one.",
    "Another sentence.",
    "Third sentence here.",
]
model = TopoLM(Config()).fit_texts(texts)

Training from Hugging Face Dataset

from topolm import load_hf_dataset

texts = load_hf_dataset(
    "wikitext",
    split="train",
    text_field="text",
    sample_size=1000
)
model = TopoLM(Config()).fit_texts(texts)

Save and Load

import tempfile
from pathlib import Path

with tempfile.TemporaryDirectory() as tmpdir:
    path = model.save(tmpdir)
    loaded = TopoLM.load(path)
    print(loaded.distribution("clarithromycin inhibits", 3))

Model Explanation

explanation = model.explain("clarithromycin inhibits", "cyp3a4")
print(f"Score: {explanation['score']:.3f}")
print(f"Breakdown: {explanation['breakdown']}")
print(f"Evidence paths: {explanation['paths'][:3]}")

CLI

Train and interact with a demo model:

topolm demo

Make predictions:

topolm predict "clarithromycin inhibits"

Generate text:

topolm generate "cyp3a4 inhibition" --decoding beam

Main Features

1. Hyperdimensional Unit Memory

Tokens and phrases are encoded into stable bipolar vectors using seeded random generation:

config = Config(dim=1024, seed=42)
hdc = HDC(dim=1024, seed=42)
vector = hdc.get("unit:clarithromycin")  # {-1, +1}^1024

2. Symbolic Graph Topology

Units, phrases, and domains are connected via typed relations:

next_unit: direct token transitions
appears_near: positional co-occurrence
likely_next: phrase continuation
domain_related: domain affinity
has_pos: part-of-speech tagging

g = model.graph
edges = list(g.out_edges("unit:clarithromycin", data=True))
for s, t, d in edges:
    print(f"{s} --{d['relation']}--> {t} (conf={d.get('confidence', 0.0):.2f})")

3. Evidence-Based Candidate Retrieval

Candidates are scored by multiple overlapping sources:

Phrase-based: exact n-gram continuations from the graph
Direct edges: observed next-token relations
Retrieved context: HDC similarity to past sentences
Domain priors: units from matching domain
Entity copy: repeat entities from input
Frequency: unigram statistics

candidates = model.retrieve_candidates(
    units=["clarithromycin", "inhibits"],
    domain="drug_interaction",
    context_text="clarithromycin inhibits"
)

4. Explainable Scoring

Each prediction includes a breakdown:

pred = model.distribution("clarithromycin inhibits", top_k=1)[0]
print(f"Text: {pred.text}")
print(f"Score: {pred.score:.3f}")
print(f"Probability: {pred.probability:.3f}")
print(f"Breakdown: {pred.breakdown}")
#  {'evidence': 0.5, 'phrase': 0.35, 'direct': 0.0, 'freq': 0.0, 'pos': 0.45, 'domain': 1.0, ...}

5. Multiple Decoding Strategies

Generate text using nucleus sampling, beam search, or greedy selection:

# Nucleus sampling (default)
text = model.generate("prompt", decoding="nucleus", top_p=0.88)

# Beam search
text = model.generate("prompt", decoding="beam", beam_width=4)

# Greedy
text = model.generate("prompt", decoding="greedy")

6. Domain Detection and Grounding

Automatic domain detection prevents category confusion:

domains = {
    "domestic": ["cat", "dog", "mat", "floor"],
    "cybersecurity": ["attacker", "exploit", "vulnerability"],
    "drug_interaction": ["cyp3a4", "clarithromycin", "inhibits"],
    "lm_research": ["language", "model", "topological"],
}
domain = model.tok.domain(["clarithromycin", "inhibits"])  # "drug_interaction"

7. Full State Persistence

Save and restore the complete model state, including graph and HDC memory:

path = model.save("./model_checkpoint")
restored = TopoLM.load(path)
# Full parity: same predictions, same graph, same counts

8. Graph Compaction

Remove low-frequency edges to reduce memory:

stats = model.mem.compact(min_edge_frequency=2)
print(f"Removed {stats['removed_edges']} edges")

Configuration

Tune behavior via Config:

from topolm import Config

config = Config(
    dim=1024,                      # HDC vector dimension
    seed=42,                       # Reproducibility
    window=8,                      # Co-occurrence window
    phrase_lengths=(2, 3, 4, 5),   # Phrase n-gram sizes
    max_candidates=96,             # Retrieval pool size
    inference_candidates=48,       # Top-k for scoring
    temperature=0.75,              # Softmax temperature
    default_top_p=0.88,            # Nucleus threshold
    default_beam_width=4,          # Beam search width
    fast_dev_mode=True,            # Disable slow features
)
model = TopoLM(config).fit(text)

Examples

basic_demo.py: Simple in-memory training and generation.
hf_dataset_demo.py: Load and train on Hugging Face datasets.

Project Structure

topolm/
  __init__.py          # Public API
  config.py            # Configuration dataclass
  core.py              # TopoLM, Memory, Tokenizer, HDC
  cli.py               # Command-line interface
  datasets.py          # Hugging Face dataset loaders
examples/
  basic_demo.py        # In-memory example
  hf_dataset_demo.py   # Hugging Face example
tests/
  test_smoke.py        # Smoke tests
.github/
  workflows/
    publish.yml        # PyPI publishing workflow
pyproject.toml         # Project metadata and dependencies

Development

# Install with dev extras
pip install -e ".[dev]"

# Format and lint
ruff check .

# Run tests
pytest -q

# Build package
python -m build

# Check distributions
twine check dist/*

Publishing

PyPI Setup

Create a PyPI account.
Generate an API token.
Store as a GitHub secret named PYPI_API_TOKEN.

Publish via CI

Tag and push a release:

git tag v0.9.2
git push origin v0.9.2

The GitHub Actions workflow .github/workflows/publish.yml will automatically build and publish to PyPI.

Manual Publishing

python -m build
twine upload dist/*

Limitations and Future Work

No fine-tuning: TopoLM learns from corpus statistics; no gradient-based learning.
Limited scalability: Designed for interpretability at the cost of training speed.
Topologist dependency: Requires topologist>=0.4.0 for graph reasoning (fallback to NetworkX).
English-focused tokenization: Custom regex tokenizer; non-English text may need adaptation.

Future improvements:

Domain-specific confidence tuning.
Multi-hop inference over learned relations.
Tensor-backed HDC for GPU acceleration.
Streaming/online updates.

License

MIT

Contributing

Contributions are welcome! Please see CONTRIBUTING.md (if applicable) or open an issue.

Citation

If you use TopoLM in research, please cite:

@software{topolm2024,
  title={TopoLM: A Topology-Native Explainable Language Model},
  author={McMenemy, Robert},
  url={https://github.com/Arkay92/TopoLM},
  year={2024},
  version={0.0.7},
}

Acknowledgments

topologist for the hyperdimensional graph engine.
networkx for core graph algorithms.
huggingface/datasets for dataset loading.

Project details

Release history Release notifications | RSS feed

0.1.0

Jun 7, 2026

0.0.11

Jun 7, 2026

0.0.10

Jun 7, 2026

This version

0.0.7

Jun 7, 2026

0.0.6

Jun 7, 2026

0.0.4

Jun 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

topolm-0.0.7-py3-none-any.whl (20.0 kB view details)

Uploaded Jun 7, 2026 Python 3

File details

Details for the file topolm-0.0.7-py3-none-any.whl.

File metadata

Download URL: topolm-0.0.7-py3-none-any.whl
Upload date: Jun 7, 2026
Size: 20.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for topolm-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63f705f80a5a6ac2228080493419e4f4c73c9298e095a45bf2d843d0d9411f99`
MD5	`c0f90fb3eb7e776e4901aad6208c8f1c`
BLAKE2b-256	`774447a8529c024d205b1bf992f4c26ab4bb10aeae6abc4618ed86d072964219`

See more details on using hashes here.

topolm 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TopoLM

Why Topology for Language Models?

Architecture

Install

Quick Start

Basic Training and Prediction

Training from Text List

Training from Hugging Face Dataset

Save and Load

Model Explanation

CLI

Main Features

1. Hyperdimensional Unit Memory

2. Symbolic Graph Topology

3. Evidence-Based Candidate Retrieval

4. Explainable Scoring

5. Multiple Decoding Strategies

6. Domain Detection and Grounding

7. Full State Persistence

8. Graph Compaction

Configuration

Examples

Project Structure

Development

Publishing

PyPI Setup

Publish via CI

Manual Publishing

Limitations and Future Work

License

Contributing

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes