engram-generator

Procedural synthetic dataset generator for training reasoning AI — 2,022 generators across 100+ scientific domains

These details have not been verified by PyPI

Project links

Project description

www.engram.one | Samples | Skill Tree | 3D Graph | PyPI

Teaching AI Models how to reason and think is tricky, very tricky. There are a lot of Datasets out there which attempt to do that, most by using NLP and text.

This is not what the Engram Generator does!

This a Real-Science focused Synthetic Generator, same to how a Teacher or Professor behaves:

It uses real Theorems, Formulas, Examples and problems
The ML model is then presented with a problem, just like you did at school
While training, the generator decomposes the solution into steps, and then presents the final answer

For evaluation and testing, a similar approach is taken:

The generator produces a new problem, passing the needed context to solve it
The ML model is expected to compose a solution and an answer

Wether you want to reward the final answer or the entire solution (a.k.a the reasoning or logic chain) is entirely up to you. A final answer reward or loss can be used, or you can use a ReasoningChain which will evaluate the model's solution as well. Me and my mate Opus 4.6[1m] built this Dataset generator to address 3 key issues:

I wanted a Dataset which is infinite and thus cannot be memorised. By adjusting the values in formulaisms and problems, no model can actually memorise the solution
I believe that learning the underlying logic of how to solve a problem is far more important that memorising a pattern, or a statistical solution.
The ability to reason must be traceable and observable and thus part of the Dataset.

And what better way to do this, other than using the largest part of the Human Scientific Knowledge? By doing so, English and Natural Language aren't the primary focus, they are simply the enabler. Symbolism at the heart of Sciences and the relationships within are the focus. However, representing and using such a vast corpus is an enormous undertaking for a single Developer doing it as a side project, which is where Opus 4.6 comes into play.

What we have created

This is a Python-based procedural dataset which contains:

2,022 generators. 100+ scientific domains. 10^81 unique problems.

It uses:

Mathematics
Physics
Chemistry
Biology
Computer science
Engineering
Quantum theory
Earth sciences
Economics
Logic and more, as step-by-step reasoning problems. The goal is not to build a benchmark. The goal is to teach machines how humans reason, discover, and invent.

Every problem is generated on-the-fly, there is no file to download.

Installing

pip install engram-generator

Current version is 0.3.0 and it contains currently unverified samples which you should not use.

Anti-Memorising Approach

Models trained on static datasets learn to pattern-match, not to reason. Train a model on 10,000 addition problems and it learns a lookup table, not addition. Change the digit count and it breaks. That's memorisation pretending to be intelligence.

Human reasoning didn't develop by memorising answers. It developed by solving problems across domains; by recognising that the same recursive structure appears in Fibonacci sequences, merge sort, and mathematical induction. That a conservation law works the same way in thermodynamics, circuit analysis, and chemical equilibria. That a proof by contradiction in logic uses the same mental move as a reducibility argument in computability theory.

So the approach here is exactly the same:

Force the model to learn the underlying structure, by repeating the problems using different variables and digits.
If the model learns easier Tiers (lower levels such as 0 ~ 3), the Dataset Generator has the ability to increase the difficulty and complexity

What this effectively creates, is a levelling up system

Levelling System

You don't get to attempt RSA encryption until you can do modular arithmetic. You don't get to critique a proof until you can write one. The skill tree enforces this:

Tier	Tasks	What You Unlock	Examples
0	20	Fundamentals	Addition, subtraction, sorting, boolean logic
1	36	Building blocks	Multiplication, Fibonacci, Caesar cipher
2	47	Algebra & graphs	Derivatives, quadratics, graph reachability
3	95	Real maths	Integrals, determinants, boolean algebra
4	313	Applied science	Physics, probability, dynamic programming
5	730	Expert territory	PDEs, cryptography, quantum mechanics
6	521	Graduate level	Topology, general relativity, information theory
7	176	Meta-reasoning	Proof strategy, error detection, generalisation
8	31	Creative	Conjecture, isomorphism detection
9	29	Research	Algorithm design, impossibility proofs
10	24	Self-architecture	Scaling laws, architecture search, loss design

Reasoning Balance

Without balancing, formula-substitution problems (55% of generators) would dominate training. The model would learn to plug numbers into equations and call it a day.

Instead, training is balanced across 26 reasoning strategies. Each gets equal exposure regardless of how many generators belong to it:

Pattern	Generators	What it teaches
Formula substitution	1,188	Plug values into known equations
Meta-reasoning	112	Proof strategy, error analysis, architecture design
Probabilistic reasoning	87	Bayes, distributions, expected values, stochastic processes
Differential equations	76	ODEs, PDEs, boundary value problems, numerical methods
Graph traversal	73	BFS, DFS, Dijkstra, flow networks, connectivity
Simulation trace	60	State machines, data structures, protocol execution
Symbolic manipulation	49	Differentiation, integration, algebraic simplification
Construction & verification	39	Group axioms, homomorphisms, topological invariants
Geometric computation	34	Areas, volumes, intersections, convex hulls
Conservation & balance	33	Thermodynamic laws, Kirchhoff, chemical equilibria
Linear algebra	31	Matrix decomposition, eigenvalues, null spaces
Counting & enumeration	28	Permutations, Catalan numbers, inclusion-exclusion
Statistical inference	26	Hypothesis testing, confidence intervals, regression
Logical deduction	26	Natural deduction, resolution, sequent calculus
Modular arithmetic	22	CRT, Euler's totient, discrete logarithms
Transform methods	20	Fourier, Laplace, Z-transform, wavelets
Dynamic programming	19	Optimal substructure, memoisation, alignment
Optimization	19	Gradient descent, KKT conditions, convex methods
Series & convergence	19	Ratio test, power series, uniform convergence
Encoding & decoding	15	RSA, Huffman, Reed-Solomon, stream ciphers
Approximation & numerical	14	Newton-Raphson, quadrature, interpolation
Recursive decomposition	11	Divide-and-conquer, Tower of Hanoi, merge sort
Comparison & ordering	9	Periodic trends, mineral identification, ranking
Dimensional analysis	8	Unit conversion, significant figures, calibration
Greedy selection	3	Interval scheduling, bin packing, set cover
Search & backtracking	1	A*, constraint satisfaction

Formula substitution has 1,188 generators. Search & backtracking has 1. But during training, both patterns get 3.8% of samples. No free rides.

Why Memorisation is Impossible

The entire curriculum is 1.85 MB of algorithms. It produces terabytes of unique instances. That's a compression ratio of 1,250,000:1. Meaning that if you use the full Engram Generator without tool or function calling, no current model can memorise the dataset. Which brings us to the second issue: pattern matching in modern ML.

The models can still learning pattern matching, by learning the Heurstics, and that, to a certain degree is useful. However, what you can do is to evaluate or test the model in Out Of Sample (OOS) Difficulty tiers. For example, a Model trained on simple Arithmetic, if capable of high quality pattern matching and heuristics, would be able to translate what it learned in unseen more difficult Problems. Furthermore, Certain problems require other types of knowledge to be properly acquired, for example:

Addition (T0) -> Multiplication (T1) -> Matrix Multiply (T4) -> Eigenvalues (T4) -> PCA, Quantum Systems, Stability Analysis (T5-T7)

Meaning, a model that can't do addition can't learn multiplication. Without multiplication, it can't do matrix multiplication. Without matrix multiplication, it can't compute eigenvalues. And without eigenvalues, it can't do PCA, solve quantum two-level systems, analyse ODE stability, compute stress tensors, or do spectral decomposition.

Tokenizer

All mathematical notation is written in LaTeX. The model learns to read and write LaTeX as a native language: fractions, integrals, matrices, Greek letters (spelled out), superscripts, subscripts, and nested expressions. This means a model trained on Engram Generator doesn't just learn to solve maths -- it learns the standard notation that humans use to communicate it.

\frac{d}{dx}(-x^2-2x-2) <step> -1*2x=-2x <step> -2*1=-2 <step> 0 <step> -2x-2

\begin{pmatrix} -5 & 3 \\ 2 & 2 \end{pmatrix} \times \begin{pmatrix} -1 & -2 \\ -3 & 8 \end{pmatrix}

\oint_{|z|=3} \frac{1}{z^{2}+z-6} dz <step> poles: z=-3, 2 <step> Res(f,2)=0.2 <step> 1.2566i

Engram Generator uses a character-level tokenizer: every character maps to exactly one token. No subword merging. No BPE. No SentencePiece.

Why? Subword tokenizers destroy the structure that reasoning depends on:

Digit atomicity: BPE merges "123" into a single token. The model can't see that the 3 is in the ones place and the 1 is in the hundreds place. Arithmetic becomes impossible. Character-level tokenization keeps every digit separate, so carry operations and place-value reasoning work naturally.
LaTeX preservation: LaTeX uses nested braces, superscripts, and subscripts (\frac{d}{dx}, x^{2}). Subword tokenizers split these unpredictably -- \frac might become \fr + ac, breaking the command boundary. Character-level tokenization preserves brace matching, command names, and operator structure exactly as written.
Deterministic alignment: Every character is exactly one token. No ambiguity about tokenization boundaries. The model's attention patterns can align precisely with the mathematical structure of the problem.

The character set (132 characters + 3 special tokens = 135 vocab):

Category	Characters
Digits (10)	`0 1 2 3 4 5 6 7 8 9`
Lowercase (26)	`a b c ... z`
Uppercase (26)	`A B C ... Z`
Greek (12)	`α β γ δ ε θ λ μ π σ φ ω`
Arithmetic (5)	`+ - * / ^`
Relations (4)	`≤ ≥ ≠ ≈`
Grouping (6)	`( ) [ ] { }`
Calculus & analysis (4)	`∂ ∫ √ ∞`
Set theory (5)	`∈ ⊂ ∅ ∩ ∪`
Logic (9)	`∀ ∃ ¬ ∧ ∨ ⊢ ⊨ ↔ ⊥`
Punctuation (9)	`= : ; ? . , ! ' "`
LaTeX & structure (7)	`\ _ \| ~ < > %`
Other (9)	`# @ $ & ° × — → (space)`
Special tokens (3)	`<pad> <eos> <step>`

The <step> token separates solution steps in the target sequence. All generator output is constrained to use only characters in this set -- any generator that produces a character outside it is a bug and is caught by the test suite.

Samples

Input:  add two 5 digit numbers
Target: 13278 + 46048 <step> 8+8=16 <step> 7+4+1=12 <step> 2+0+1=3 <step> 3+6=9 <step> 1+4=5 <step> 59326

Input: natural language task description
Target: problem, solution steps, and answer separated by <step> tokens
Both capped at 512 characters

Usage

Generate samples

from engram_generator.curriculum.registry import get_generator

gen = get_generator("addition", min_difficulty=3, max_difficulty=5)
samples = gen.generate(100)

for sample in samples[:3]:
    print(f"Input:  {sample.input_text}")
    print(f"Target: {sample.target_text}")
    print(f"Answer: {sample.answer}")

Use the skill tree

from engram_generator.curriculum.registry import get_all_generators
from engram_generator.curriculum.skill_tree import SkillTree

generators = get_all_generators()
tree = SkillTree(generators, retention_ratio=0.1)

# See what's unlocked
print(tree.get_unlocked_tasks())

# Level up by proving mastery
events = tree.update({"addition": 0.97, "subtraction": 0.85})

Balanced training

from engram_generator.curriculum.reasoning_patterns import (
    get_pattern_weights, get_pattern_summary,
)
from engram_generator.curriculum.registry import get_all_generators

gens = get_all_generators()
weights = get_pattern_weights(gens)

# Each of the 26 reasoning patterns gets equal training exposure
summary = get_pattern_summary(gens)
for pattern, count in sorted(summary.items(), key=lambda x: -x[1])[:5]:
    print(f"{pattern}: {count} generators -> 3.8% of training")

A Word of Warning:

I am actively developing this at the moment, and changes will happen
I can not review manually the entire Universe of Human Knowledge and Discovery, which is why the code for the largest part is Claude-generated
There's been a lot of tokens burnt to verify the input, solution and answers, this is done by using 3rd party libraries to verify answers where possible.
Where not possible if a Wikipedia example exists, a double-blind verification is done, where Claude creates a generator with the same arguments, and the Generator's answer must match what the Wiki example states. This is still fallible to bad parsing
Where there are no examples, no libraries, and only formulae or therems, they are currently marked as unverified. Those will be verified using Fable and eventually audited manually by a Human. This is too verify that the Engram Generator doesn't produce incorrect Samples or Garbage.

License

MIT

Organisation

www.engram.one · www.deepnet.one

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jul 24, 2026

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

engram_generator-0.3.0.tar.gz (2.6 MB view details)

Uploaded Jul 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

engram_generator-0.3.0-py3-none-any.whl (2.8 MB view details)

Uploaded Jul 24, 2026 Python 3

File details

Details for the file engram_generator-0.3.0.tar.gz.

File metadata

Download URL: engram_generator-0.3.0.tar.gz
Upload date: Jul 24, 2026
Size: 2.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for engram_generator-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`3f3be60407cc63b60d5f8319b0db44f05dac51af4bcb8967bd6ec9152a03c27a`
MD5	`c509bb44e1c040ab5422c326e1359b0b`
BLAKE2b-256	`52a9eb5b6525d052475941cada5fdaaa0d8e064ed12b7adb2876e96a4f64baa9`

See more details on using hashes here.

File details

Details for the file engram_generator-0.3.0-py3-none-any.whl.

File metadata

Download URL: engram_generator-0.3.0-py3-none-any.whl
Upload date: Jul 24, 2026
Size: 2.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for engram_generator-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1703675a936d625932eb6b780aadab9e0a0bd1716211df58c68c2e08f672c6b1`
MD5	`6a5781ec43bcb80e4135a52b97d1a0e3`
BLAKE2b-256	`bd0458f4e975653845dc74c64149929c2b444b56fc192ab159de256cb8dff7d5`

See more details on using hashes here.

engram-generator 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What we have created

Installing

Anti-Memorising Approach

Levelling System

Reasoning Balance

Why Memorisation is Impossible

Tokenizer

Samples

Usage

Generate samples

Use the skill tree

Balanced training

A Word of Warning:

License

Organisation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes