Skip to main content

Topology-native explainable language model prototype powered by Topologist

Project description

TopoLM

A topology-native, explainable language model prototype powered by topologist.

ChatGPT Image Jun 7, 2026, 11_38_36 AM

Publish PyPI Python License

TopoLM combines:

  • Topology-native graph memory using topologist and NetworkX.
  • Hyperdimensional encoding for unit, domain, and sentence representations.
  • Evidence-based candidate retrieval from phrase continuations, direct edges, and retrieved contexts.
  • Explainable scoring with breakdowns of evidence, domain match, POS grammar, and repetition penalties.
  • Generation with multiple decoding strategies (nucleus, beam, greedy) and phrase-tail detection.
  • Hugging Face dataset support for training on large text corpora.
  • Persistence with full state save/load, graph serialization, and memory reconstruction.

Why Topology for Language Models?

Most neural LMs are opaque black boxes. Most symbolic systems are brittle and limited.

TopoLM sits between:

Input text
  -> Tokenize & domain detect
  -> Build symbolic graph (units, phrases, domains, POS)
  -> HDC encoding for each node
  -> Topological memory state
  
  -> Inference (next-token prediction, generation)
  -> Explainable evidence trails
  -> Drift detection & refinement

Each token, phrase, and domain relationship is stored explicitly in the graph, encoded into a high-dimensional bipolar vector, and scored by evidence, topology, and confidence. This gives you a language model that is:

  • Interpretable: see exactly why a prediction was made.
  • Grounded: graph structure prevents nonsense outputs.
  • Efficient: no matrix multiplications; graph queries and HDC similarity.
  • Debuggable: modify graph state, track provenance, refine confidence.

Architecture

Text input
    |
    v
Tokenizer (unit, POS, domain, entity recognition)
    |
    v
Graph builder
  - Unit nodes (with frequency, domain, POS)
  - Phrase nodes (with multi-gram spans)
  - Domain nodes
  - Relations (next_unit, appears_near, likely_next, domain_related, has_pos)
    |
    v
HDC Memory (Topologist + fallback NetworkX)
  - Encode units, phrases, domains, positions into {-1,+1}^D vectors
  - Store graph topology
  - Bundled snapshots for drift
    |
    v
Inference (Predict or Generate)
  - Context Index (HDC similarity retrieval)
  - Candidate retrieval (phrase continuation, direct edges, domain priors, unigrams)
  - Evidence scoring (weighted by source: phrase, direct, RAG, domain, frequency)
  - Grammar validation (POS sequences)
  - Sampling (nucleus, beam, greedy)

Install

pip install topolm

For Hugging Face dataset support:

pip install topolm[hf]

For development:

pip install -e ".[dev]"
pytest -q
python -m build
twine check dist/*

Quick Start

Basic Training and Prediction

from topolm import TopoLM, Config

corpus = """
The cat sat on the mat.
The dog sat on the floor.
CYP3A4 inhibition increases drug exposure.
Clarithromycin inhibits CYP3A4.
"""

model = TopoLM(Config()).fit(corpus)

# Get next-token predictions
preds = model.distribution("clarithromycin inhibits", top_k=5)
for p in preds:
    print(f"  {p.text:20s} prob={p.probability:.3f} score={p.score:.3f}")

# Generate fluent text
generated = model.generate("cyp3a4 inhibition", decoding="beam")
print(generated)

Training from Text List

texts = [
    "Sentence one.",
    "Another sentence.",
    "Third sentence here.",
]
model = TopoLM(Config()).fit_texts(texts)

Training from Hugging Face Dataset

from topolm import load_hf_dataset

texts = load_hf_dataset(
    "wikitext",
    split="train",
    text_field="text",
    sample_size=1000
)
model = TopoLM(Config()).fit_texts(texts)

Save and Load

import tempfile
from pathlib import Path

with tempfile.TemporaryDirectory() as tmpdir:
    path = model.save(tmpdir)
    loaded = TopoLM.load(path)
    print(loaded.distribution("clarithromycin inhibits", 3))

Model Explanation

explanation = model.explain("clarithromycin inhibits", "cyp3a4")
print(f"Score: {explanation['score']:.3f}")
print(f"Breakdown: {explanation['breakdown']}")
print(f"Evidence paths: {explanation['paths'][:3]}")

CLI

Train and interact with a demo model:

topolm demo

Make predictions:

topolm predict "clarithromycin inhibits"

Generate text:

topolm generate "cyp3a4 inhibition" --decoding beam

Main Features

1. Hyperdimensional Unit Memory

Tokens and phrases are encoded into stable bipolar vectors using seeded random generation:

config = Config(dim=1024, seed=42)
hdc = HDC(dim=1024, seed=42)
vector = hdc.get("unit:clarithromycin")  # {-1, +1}^1024

2. Symbolic Graph Topology

Units, phrases, and domains are connected via typed relations:

  • next_unit: direct token transitions
  • appears_near: positional co-occurrence
  • likely_next: phrase continuation
  • domain_related: domain affinity
  • has_pos: part-of-speech tagging
g = model.graph
edges = list(g.out_edges("unit:clarithromycin", data=True))
for s, t, d in edges:
    print(f"{s} --{d['relation']}--> {t} (conf={d.get('confidence', 0.0):.2f})")

3. Evidence-Based Candidate Retrieval

Candidates are scored by multiple overlapping sources:

  • Phrase-based: exact n-gram continuations from the graph
  • Direct edges: observed next-token relations
  • Retrieved context: HDC similarity to past sentences
  • Domain priors: units from matching domain
  • Entity copy: repeat entities from input
  • Frequency: unigram statistics
candidates = model.retrieve_candidates(
    units=["clarithromycin", "inhibits"],
    domain="drug_interaction",
    context_text="clarithromycin inhibits"
)

4. Explainable Scoring

Each prediction includes a breakdown:

pred = model.distribution("clarithromycin inhibits", top_k=1)[0]
print(f"Text: {pred.text}")
print(f"Score: {pred.score:.3f}")
print(f"Probability: {pred.probability:.3f}")
print(f"Breakdown: {pred.breakdown}")
#  {'evidence': 0.5, 'phrase': 0.35, 'direct': 0.0, 'freq': 0.0, 'pos': 0.45, 'domain': 1.0, ...}

5. Multiple Decoding Strategies

Generate text using nucleus sampling, beam search, or greedy selection:

# Nucleus sampling (default)
text = model.generate("prompt", decoding="nucleus", top_p=0.88)

# Beam search
text = model.generate("prompt", decoding="beam", beam_width=4)

# Greedy
text = model.generate("prompt", decoding="greedy")

6. Domain Detection and Grounding

Automatic domain detection prevents category confusion:

domains = {
    "domestic": ["cat", "dog", "mat", "floor"],
    "cybersecurity": ["attacker", "exploit", "vulnerability"],
    "drug_interaction": ["cyp3a4", "clarithromycin", "inhibits"],
    "lm_research": ["language", "model", "topological"],
}
domain = model.tok.domain(["clarithromycin", "inhibits"])  # "drug_interaction"

7. Full State Persistence

Save and restore the complete model state, including graph and HDC memory:

path = model.save("./model_checkpoint")
restored = TopoLM.load(path)
# Full parity: same predictions, same graph, same counts

8. Graph Compaction

Remove low-frequency edges to reduce memory:

stats = model.mem.compact(min_edge_frequency=2)
print(f"Removed {stats['removed_edges']} edges")

Configuration

Tune behavior via Config:

from topolm import Config

config = Config(
    dim=1024,                      # HDC vector dimension
    seed=42,                       # Reproducibility
    window=8,                      # Co-occurrence window
    phrase_lengths=(2, 3, 4, 5),   # Phrase n-gram sizes
    max_candidates=96,             # Retrieval pool size
    inference_candidates=48,       # Top-k for scoring
    temperature=0.75,              # Softmax temperature
    default_top_p=0.88,            # Nucleus threshold
    default_beam_width=4,          # Beam search width
    fast_dev_mode=True,            # Disable slow features
)
model = TopoLM(config).fit(text)

Examples


Project Structure

topolm/
  __init__.py          # Public API
  config.py            # Configuration dataclass
  core.py              # TopoLM, Memory, Tokenizer, HDC
  cli.py               # Command-line interface
  datasets.py          # Hugging Face dataset loaders
examples/
  basic_demo.py        # In-memory example
  hf_dataset_demo.py   # Hugging Face example
tests/
  test_smoke.py        # Smoke tests
.github/
  workflows/
    publish.yml        # PyPI publishing workflow
pyproject.toml         # Project metadata and dependencies

Development

# Install with dev extras
pip install -e ".[dev]"

# Format and lint
ruff check .

# Run tests
pytest -q

# Build package
python -m build

# Check distributions
twine check dist/*

Publishing

PyPI Setup

  1. Create a PyPI account.
  2. Generate an API token.
  3. Store as a GitHub secret named PYPI_API_TOKEN.

Publish via CI

Tag and push a release:

git tag v0.9.2
git push origin v0.9.2

The GitHub Actions workflow .github/workflows/publish.yml will automatically build and publish to PyPI.

Manual Publishing

python -m build
twine upload dist/*

Limitations and Future Work

  • No fine-tuning: TopoLM learns from corpus statistics; no gradient-based learning.
  • Limited scalability: Designed for interpretability at the cost of training speed.
  • Topologist dependency: Requires topologist>=0.4.0 for graph reasoning (fallback to NetworkX).
  • English-focused tokenization: Custom regex tokenizer; non-English text may need adaptation.

Future improvements:

  • Domain-specific confidence tuning.
  • Multi-hop inference over learned relations.
  • Tensor-backed HDC for GPU acceleration.
  • Streaming/online updates.

License

MIT


Contributing

Contributions are welcome! Please see CONTRIBUTING.md (if applicable) or open an issue.


Citation

If you use TopoLM in research, please cite:

@software{topolm2024,
  title={TopoLM: A Topology-Native Explainable Language Model},
  author={McMenemy, Robert},
  url={https://github.com/Arkay92/TopoLM},
  year={2024},
  version={0.0.4},
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topolm-0.0.4-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file topolm-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: topolm-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for topolm-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 240dfc48f55de9719cfc6a41517bf50db8d8c1fa0d7eb9c229ca1d389bee3fe2
MD5 5822696188d421fa954fffa306ac4ab4
BLAKE2b-256 003a47bb20a14f4958d21853194a3f770476ad14429fc8acee023802bdb35072

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page