Skip to main content

Topology-native explainable language model prototype powered by Topologist

Project description

TopoLM

A topology-native, explainable language model prototype powered by topologist.

TopoLM Logo

Publish PyPI Python Downloads License

TopoLM combines:

  • Topology-native graph memory using topologist and NetworkX.
  • Hyperdimensional encoding for unit, domain, and sentence representations.
  • Evidence-based candidate retrieval from phrase continuations, direct edges, and retrieved contexts.
  • Explainable scoring with breakdowns of evidence, domain match, POS grammar, and repetition penalties.
  • Generation with multiple decoding strategies (nucleus, beam, greedy) and phrase-tail detection.
  • Hugging Face dataset support for training on large text corpora.
  • Persistence with full state save/load, graph serialization, and memory reconstruction.

Why Topology for Language Models?

Most neural LMs are opaque black boxes. Most symbolic systems are brittle and limited.

TopoLM sits between:

Input text
  -> Tokenize & domain detect
  -> Build symbolic graph (units, phrases, domains, POS)
  -> HDC encoding for each node
  -> Topological memory state
  
  -> Inference (next-token prediction, generation)
  -> Explainable evidence trails
  -> Drift detection & refinement

Each token, phrase, and domain relationship is stored explicitly in the graph, encoded into a high-dimensional bipolar vector, and scored by evidence, topology, and confidence. This gives you a language model that is:

  • Interpretable: see exactly why a prediction was made.
  • Grounded: graph structure prevents nonsense outputs.
  • Efficient: no matrix multiplications; graph queries and HDC similarity.
  • Debuggable: modify graph state, track provenance, refine confidence.

Architecture

Text input
    |
    v
Tokenizer (unit, POS, domain, entity recognition)
    |
    v
Graph builder
  - Unit nodes (with frequency, domain, POS)
  - Phrase nodes (with multi-gram spans)
  - Domain nodes
  - Relations (next_unit, appears_near, likely_next, domain_related, has_pos)
    |
    v
HDC Memory (Topologist + fallback NetworkX)
  - Encode units, phrases, domains, positions into {-1,+1}^D vectors
  - Store graph topology
  - Bundled snapshots for drift
    |
    v
Inference (Predict or Generate)
  - Context Index (HDC similarity retrieval)
  - Candidate retrieval (phrase continuation, direct edges, domain priors, unigrams)
  - Evidence scoring (weighted by source: phrase, direct, RAG, domain, frequency)
  - Grammar validation (POS sequences)
  - Sampling (nucleus, beam, greedy)

Install

pip install topolm

For Hugging Face dataset support:

pip install topolm[hf]

For development:

pip install -e ".[dev]"
pytest -q
python -m build
twine check dist/*

Quick Start

Basic Training and Prediction

from topolm import TopoLM, Config

corpus = """
The cat sat on the mat.
The dog sat on the floor.
CYP3A4 inhibition increases drug exposure.
Clarithromycin inhibits CYP3A4.
"""

model = TopoLM(Config()).fit(corpus)

# Get next-token predictions
preds = model.distribution("clarithromycin inhibits", top_k=5)
for p in preds:
    print(f"  {p.text:20s} prob={p.probability:.3f} score={p.score:.3f}")

# Generate fluent text
generated = model.generate("cyp3a4 inhibition", decoding="beam")
print(generated)

Training from Text List

texts = [
    "Sentence one.",
    "Another sentence.",
    "Third sentence here.",
]
model = TopoLM(Config()).fit_texts(texts)

Training from Hugging Face Dataset

from topolm import load_hf_dataset

texts = load_hf_dataset(
    "wikitext",
    split="train",
    text_field="text",
    sample_size=1000
)
model = TopoLM(Config()).fit_texts(texts)

Save and Load

import tempfile
from pathlib import Path

with tempfile.TemporaryDirectory() as tmpdir:
    path = model.save(tmpdir)
    loaded = TopoLM.load(path)
    print(loaded.distribution("clarithromycin inhibits", 3))

Model Explanation

explanation = model.explain("clarithromycin inhibits", "cyp3a4")
print(f"Score: {explanation['score']:.3f}")
print(f"Breakdown: {explanation['breakdown']}")
print(f"Evidence paths: {explanation['paths'][:3]}")

CLI

Train and interact with a demo model:

topolm demo

Make predictions:

topolm predict "clarithromycin inhibits"

Generate text:

topolm generate "cyp3a4 inhibition" --decoding beam

Main Features

1. Hyperdimensional Unit Memory

Tokens and phrases are encoded into stable bipolar vectors using seeded random generation:

config = Config(dim=1024, seed=42)
hdc = HDC(dim=1024, seed=42)
vector = hdc.get("unit:clarithromycin")  # {-1, +1}^1024

2. Symbolic Graph Topology

Units, phrases, and domains are connected via typed relations:

  • next_unit: direct token transitions
  • appears_near: positional co-occurrence
  • likely_next: phrase continuation
  • domain_related: domain affinity
  • has_pos: part-of-speech tagging
g = model.graph
edges = list(g.out_edges("unit:clarithromycin", data=True))
for s, t, d in edges:
    print(f"{s} --{d['relation']}--> {t} (conf={d.get('confidence', 0.0):.2f})")

3. Evidence-Based Candidate Retrieval

Candidates are scored by multiple overlapping sources:

  • Phrase-based: exact n-gram continuations from the graph
  • Direct edges: observed next-token relations
  • Retrieved context: HDC similarity to past sentences
  • Domain priors: units from matching domain
  • Entity copy: repeat entities from input
  • Frequency: unigram statistics
candidates = model.retrieve_candidates(
    units=["clarithromycin", "inhibits"],
    domain="drug_interaction",
    context_text="clarithromycin inhibits"
)

4. Explainable Scoring

Each prediction includes a breakdown:

pred = model.distribution("clarithromycin inhibits", top_k=1)[0]
print(f"Text: {pred.text}")
print(f"Score: {pred.score:.3f}")
print(f"Probability: {pred.probability:.3f}")
print(f"Breakdown: {pred.breakdown}")
#  {'evidence': 0.5, 'phrase': 0.35, 'direct': 0.0, 'freq': 0.0, 'pos': 0.45, 'domain': 1.0, ...}

5. Multiple Decoding Strategies

Generate text using nucleus sampling, beam search, or greedy selection:

# Nucleus sampling (default)
text = model.generate("prompt", decoding="nucleus", top_p=0.88)

# Beam search
text = model.generate("prompt", decoding="beam", beam_width=4)

# Greedy
text = model.generate("prompt", decoding="greedy")

6. Domain Detection and Grounding

Automatic domain detection prevents category confusion:

domains = {
    "domestic": ["cat", "dog", "mat", "floor"],
    "cybersecurity": ["attacker", "exploit", "vulnerability"],
    "drug_interaction": ["cyp3a4", "clarithromycin", "inhibits"],
    "lm_research": ["language", "model", "topological"],
}
domain = model.tok.domain(["clarithromycin", "inhibits"])  # "drug_interaction"

7. Full State Persistence

Save and restore the complete model state, including graph and HDC memory:

path = model.save("./model_checkpoint")
restored = TopoLM.load(path)
# Full parity: same predictions, same graph, same counts

8. Graph Compaction

Remove low-frequency edges to reduce memory:

stats = model.mem.compact(min_edge_frequency=2)
print(f"Removed {stats['removed_edges']} edges")

Configuration

Tune behavior via Config:

from topolm import Config

config = Config(
    dim=1024,                      # HDC vector dimension
    seed=42,                       # Reproducibility
    window=8,                      # Co-occurrence window
    phrase_lengths=(2, 3, 4, 5),   # Phrase n-gram sizes
    max_candidates=96,             # Retrieval pool size
    inference_candidates=48,       # Top-k for scoring
    temperature=0.75,              # Softmax temperature
    default_top_p=0.88,            # Nucleus threshold
    default_beam_width=4,          # Beam search width
    fast_dev_mode=True,            # Disable slow features
)
model = TopoLM(config).fit(text)

Examples


Project Structure

topolm/
  __init__.py          # Public API
  config.py            # Configuration dataclass
  core.py              # TopoLM, Memory, Tokenizer, HDC
  cli.py               # Command-line interface
  datasets.py          # Hugging Face dataset loaders
examples/
  basic_demo.py        # In-memory example
  hf_dataset_demo.py   # Hugging Face example
tests/
  test_smoke.py        # Smoke tests
.github/
  workflows/
    publish.yml        # PyPI publishing workflow
pyproject.toml         # Project metadata and dependencies

Development

# Install with dev extras
pip install -e ".[dev]"

# Format and lint
ruff check .

# Run tests
pytest -q

# Build package
python -m build

# Check distributions
twine check dist/*

Limitations and Future Work

  • No fine-tuning: TopoLM learns from corpus statistics; no gradient-based learning.
  • Limited scalability: Designed for interpretability at the cost of training speed.
  • Topologist dependency: Requires topologist>=0.4.0 for graph reasoning (fallback to NetworkX).
  • English-focused tokenization: Custom regex tokenizer; non-English text may need adaptation.

Future improvements:

  • Domain-specific confidence tuning.
  • Multi-hop inference over learned relations.
  • Tensor-backed HDC for GPU acceleration.
  • Streaming/online updates.

License

MIT


Contributing

Contributions are welcome! Please see CONTRIBUTING.md (if applicable) or open an issue.


Citation

If you use TopoLM in research, please cite:

@software{topolm2024,
  title={TopoLM: A Topology-Native Explainable Language Model},
  author={McMenemy, Robert},
  url={https://github.com/Arkay92/TopoLM},
  year={2024},
  version={0.1.0},
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topolm-0.1.0.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topolm-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file topolm-0.1.0.tar.gz.

File metadata

  • Download URL: topolm-0.1.0.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topolm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7faa6bf775af432d4ee800d51c4a66182bbd28306322b87756b08b6d2c422ac7
MD5 9f391195f528bbf46f29deab6f21372e
BLAKE2b-256 d4c337348e1061e4d66e910f54b2a659a430debfc526ecf896069ebc2f575642

See more details on using hashes here.

Provenance

The following attestation bundles were made for topolm-0.1.0.tar.gz:

Publisher: publish.yml on Arkay92/TopoLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topolm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: topolm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topolm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad857ea39d5141943d2cbed67b13a485bd1ae109f86143de28907757b249a83e
MD5 d74ce0012f07a25cd0d95383f3a0b8af
BLAKE2b-256 8ae5eaab53b00d22c304f3892f7fed85329bdbebe8d438a0d3a9ce258199e758

See more details on using hashes here.

Provenance

The following attestation bundles were made for topolm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Arkay92/TopoLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page