Procedural synthetic dataset generator for training reasoning AI — 2,022 generators across 100+ scientific domains
Project description
_____ _ _ ____ ____ _ __ __
| ____| \ | |/ ___| _ \ / \ | \/ |
| _| | \| | | _| |_) | / _ \ | |\/| |
| |___| |\ | |_| | _ < / ___ \| | | |
|_____|_| \_|\____|_| \_\/_/ \_\_| |_|
G E N E R A T O R
2,022 generators. 100+ scientific domains. 10^81 unique problems.
A procedural dataset that encodes the breadth of human scientific knowledge -- mathematics, physics, chemistry, biology, computer science, engineering, quantum theory, earth sciences, economics, logic, and more -- as step-by-step reasoning problems. The goal is not to build a benchmark. The goal is to teach machines how humans reason, discover, and invent.
Every problem is generated on-the-fly. Every answer is correct by construction. There is no dataset file -- just code that writes an endless exam across every discipline humans have formalised.
This repository contains AI-generated code, reviewed and directed by a human author.
pip install engram-generator
The Problem
Models trained on static datasets learn to pattern-match, not to reason. Train a model on 10,000 addition problems and it learns a lookup table, not addition. Change the digit count and it breaks. That's memorisation pretending to be intelligence.
Human reasoning didn't develop by memorising answers. It developed by solving problems across domains -- by recognising that the same recursive structure appears in Fibonacci sequences, merge sort, and mathematical induction. That a conservation law works the same way in thermodynamics, circuit analysis, and chemical equilibria. That a proof by contradiction in logic uses the same mental move as a reducibility argument in computability theory.
Engram Generator encodes this cross-domain structure:
- 10^81 unique problems -- more than atoms in the observable universe
- Step-by-step solutions -- show your work or fail
- 26 reasoning strategies with balanced exposure -- no single trick works
- 100+ scientific domains -- the breadth of formalised human knowledge
- Adaptive difficulty -- the curriculum escalates as the model improves
- Must Level Up -- advanced tasks are locked behind mastery of prerequisites
- Provably correct -- every answer is generated by the same algorithm that generated the problem
The Arc
A model trained on this curriculum climbs from counting to self-awareness:
Tier 0 "2 + 3 = 5"
|
Tier 2 "d/dx(3x^2 + 2x) = 6x + 2"
|
Tier 5 "curl F = (dFz/dy - dFy/dz, ...)"
|
Tier 7 "This proof has an error in step 3. Here is the correction."
|
Tier 8 "These two problems share an isomorphic structure."
|
Tier 9 "To solve this class of problems, I would design the following algorithm."
|
Tier 10 "My architecture struggles with length generalisation.
Here is a proposed modification."
From following procedures to creating them. From solving problems to understanding what makes problems solvable.
What's Inside
| Domain | Generators | Highlights |
|---|---|---|
| Mathematics | 730+ | Arithmetic through category theory, PDEs, algebraic geometry, measure theory, homological algebra |
| Physics | 200+ | Classical mechanics to quantum field theory, plasma physics, particle physics, general relativity |
| Computer Science | 230+ | Algorithms, cryptography, compilers, distributed systems, ML theory, formal verification |
| Chemistry | 80+ | General, organic, physical, spectroscopy, polymer science, electrochemistry |
| Biology & Health | 90+ | Genetics, biochemistry, epidemiology, neuroscience, systems biology, pharmacology |
| Engineering | 100+ | Signal processing, control theory, semiconductors, photonics, aerospace, structural |
| Quantum | 50+ | Formalism, information theory, field theory, error correction |
| Earth & Space | 30+ | Astronomy, geology, oceanography, geophysics, climate science |
| Social & Cognitive | 50+ | Economics, game theory, linguistics, causal inference, cognitive science |
| Logic & Foundations | 50+ | Formal logic, model theory, computability, proof theory, set theory |
| Other | 100+ | Music theory, financial maths, medical imaging, persistent homology, wavelet theory |
Levelling System
You don't get to attempt RSA encryption until you can do modular arithmetic. You don't get to critique a proof until you can write one. The skill tree enforces this:
| Tier | Tasks | What You Unlock | Examples |
|---|---|---|---|
| 0 | 20 | Fundamentals | Addition, subtraction, sorting, boolean logic |
| 1 | 36 | Building blocks | Multiplication, Fibonacci, Caesar cipher |
| 2 | 47 | Algebra & graphs | Derivatives, quadratics, graph reachability |
| 3 | 95 | Real maths | Integrals, determinants, boolean algebra |
| 4 | 313 | Applied science | Physics, probability, dynamic programming |
| 5 | 730 | Expert territory | PDEs, cryptography, quantum mechanics |
| 6 | 521 | Graduate level | Topology, general relativity, information theory |
| 7 | 176 | Meta-reasoning | Proof strategy, error detection, generalisation |
| 8 | 31 | Creative | Conjecture, isomorphism detection |
| 9 | 29 | Research | Algorithm design, impossibility proofs |
| 10 | 24 | Self-architecture | Scaling laws, architecture search, loss design |
Reasoning Balance
Without balancing, formula-substitution problems (55% of generators) would dominate training. The model would learn to plug numbers into equations and call it a day.
Instead, training is balanced across 26 reasoning strategies. Each gets equal exposure regardless of how many generators belong to it:
| Pattern | Generators | What it teaches |
|---|---|---|
| Formula substitution | 1,188 | Plug values into known equations |
| Meta-reasoning | 112 | Proof strategy, error analysis, architecture design |
| Probabilistic reasoning | 87 | Bayes, distributions, expected values, stochastic processes |
| Differential equations | 76 | ODEs, PDEs, boundary value problems, numerical methods |
| Graph traversal | 73 | BFS, DFS, Dijkstra, flow networks, connectivity |
| Simulation trace | 60 | State machines, data structures, protocol execution |
| Symbolic manipulation | 49 | Differentiation, integration, algebraic simplification |
| Construction & verification | 39 | Group axioms, homomorphisms, topological invariants |
| Geometric computation | 34 | Areas, volumes, intersections, convex hulls |
| Conservation & balance | 33 | Thermodynamic laws, Kirchhoff, chemical equilibria |
| Linear algebra | 31 | Matrix decomposition, eigenvalues, null spaces |
| Counting & enumeration | 28 | Permutations, Catalan numbers, inclusion-exclusion |
| Statistical inference | 26 | Hypothesis testing, confidence intervals, regression |
| Logical deduction | 26 | Natural deduction, resolution, sequent calculus |
| Modular arithmetic | 22 | CRT, Euler's totient, discrete logarithms |
| Transform methods | 20 | Fourier, Laplace, Z-transform, wavelets |
| Dynamic programming | 19 | Optimal substructure, memoisation, alignment |
| Optimization | 19 | Gradient descent, KKT conditions, convex methods |
| Series & convergence | 19 | Ratio test, power series, uniform convergence |
| Encoding & decoding | 15 | RSA, Huffman, Reed-Solomon, stream ciphers |
| Approximation & numerical | 14 | Newton-Raphson, quadrature, interpolation |
| Recursive decomposition | 11 | Divide-and-conquer, Tower of Hanoi, merge sort |
| Comparison & ordering | 9 | Periodic trends, mineral identification, ranking |
| Dimensional analysis | 8 | Unit conversion, significant figures, calibration |
| Greedy selection | 3 | Interval scheduling, bin packing, set cover |
| Search & backtracking | 1 | A*, constraint satisfaction |
Formula substitution has 1,188 generators. Search & backtracking has 1. But during training, both patterns get 3.8% of samples. No free rides.
Why Memorisation is Impossible
The entire curriculum is 1.85 MB of algorithms. It produces terabytes of unique instances. That's a compression ratio of 1,250,000:1.
| Difficulty range | Unique problems | For scale... |
|---|---|---|
| d=1 only | ~10^12 | More than all Google searches ever |
| d=1-4 | ~10^41 | Grains of sand on Earth, squared |
| d=1-8 (full) | ~10^81 | Atoms in the observable universe |
Even the largest models can't put a dent in it:
| Model | Parameters | Can memorise | Coverage of 10^81 |
|---|---|---|---|
| GPT-2 | 124,000,000 | ~134,000 | 10^-76 |
| Llama-2 7B | 7,000,000,000 | ~7.5M | 10^-74 |
| Llama-2 70B | 70,000,000,000 | ~75M | 10^-73 |
| GPT-4 (est. ~1.8T) | 1,800,000,000,000 | ~1.9B | 10^-72 |
| Llama-3.1 405B | 405,000,000,000 | ~438M | 10^-72 |
GPT-4, estimated at 1.8 trillion parameters, could memorise roughly 2 billion samples. The dataset has 10^81. The gap is 72 orders of magnitude. The only winning strategy is to learn the algorithms.
And here's the kicker: the algorithmic information (1.85 MB) fits inside even a 1M parameter model with 14x headroom. Models can store every algorithm. They cannot store even a billionth of the instances.
Tokenizer
All mathematical notation is written in LaTeX. The model learns to read and write LaTeX as a native language -- fractions, integrals, matrices, Greek letters (spelled out), superscripts, subscripts, and nested expressions. This means a model trained on Engram Generator doesn't just learn to solve maths -- it learns the standard notation that humans use to communicate it.
\frac{d}{dx}(-x^2-2x-2) <step> -1*2x=-2x <step> -2*1=-2 <step> 0 <step> -2x-2
\begin{pmatrix} -5 & 3 \\ 2 & 2 \end{pmatrix} \times \begin{pmatrix} -1 & -2 \\ -3 & 8 \end{pmatrix}
\oint_{|z|=3} \frac{1}{z^{2}+z-6} dz <step> poles: z=-3, 2 <step> Res(f,2)=0.2 <step> 1.2566i
Engram Generator uses a character-level tokenizer -- every character maps to exactly one token. No subword merging. No BPE. No SentencePiece.
Why? Subword tokenizers destroy the structure that reasoning depends on:
- Digit atomicity: BPE merges
"123"into a single token. The model can't see that the3is in the ones place and the1is in the hundreds place. Arithmetic becomes impossible. Character-level tokenization keeps every digit separate, so carry operations and place-value reasoning work naturally. - LaTeX preservation: LaTeX uses nested braces, superscripts, and subscripts (
\frac{d}{dx},x^{2}). Subword tokenizers split these unpredictably --\fracmight become\fr+ac, breaking the command boundary. Character-level tokenization preserves brace matching, command names, and operator structure exactly as written. - Deterministic alignment: Every character is exactly one token. No ambiguity about tokenization boundaries. The model's attention patterns can align precisely with the mathematical structure of the problem.
The character set (132 characters + 3 special tokens = 135 vocab):
| Category | Characters |
|---|---|
| Digits (10) | 0 1 2 3 4 5 6 7 8 9 |
| Lowercase (26) | a b c ... z |
| Uppercase (26) | A B C ... Z |
| Greek (12) | α β γ δ ε θ λ μ π σ φ ω |
| Arithmetic (5) | + - * / ^ |
| Relations (4) | ≤ ≥ ≠ ≈ |
| Grouping (6) | ( ) [ ] { } |
| Calculus & analysis (4) | ∂ ∫ √ ∞ |
| Set theory (5) | ∈ ⊂ ∅ ∩ ∪ |
| Logic (9) | ∀ ∃ ¬ ∧ ∨ ⊢ ⊨ ↔ ⊥ |
| Punctuation (9) | = : ; ? . , ! ' " |
| LaTeX & structure (7) | \ _ | ~ < > % |
| Other (9) | # @ $ & ° × — → (space) |
| Special tokens (3) | <pad> <eos> <step> |
The <step> token separates solution steps in the target sequence. All generator output is constrained to use only characters in this set -- any generator that produces a character outside it is a bug and is caught by the test suite.
Samples
Input: add two 5 digit numbers
Target: 13278 + 46048 <step> 8+8=16 <step> 7+4+1=12 <step> 2+0+1=3 <step> 3+6=9 <step> 1+4=5 <step> 59326
- Input: natural language task description
- Target: problem, solution steps, and answer separated by
<step>tokens - Both capped at 512 characters
Usage
Generate samples
from engram_generator.curriculum.registry import get_generator
gen = get_generator("addition", min_difficulty=3, max_difficulty=5)
samples = gen.generate(100)
for sample in samples[:3]:
print(f"Input: {sample.input_text}")
print(f"Target: {sample.target_text}")
print(f"Answer: {sample.answer}")
Use the skill tree
from engram_generator.curriculum.registry import get_all_generators
from engram_generator.curriculum.skill_tree import SkillTree
generators = get_all_generators()
tree = SkillTree(generators, retention_ratio=0.1)
# See what's unlocked
print(tree.get_unlocked_tasks())
# Level up by proving mastery
events = tree.update({"addition": 0.97, "subtraction": 0.85})
Balanced training
from engram_generator.curriculum.reasoning_patterns import (
get_pattern_weights, get_pattern_summary,
)
from engram_generator.curriculum.registry import get_all_generators
gens = get_all_generators()
weights = get_pattern_weights(gens)
# Each of the 26 reasoning patterns gets equal training exposure
summary = get_pattern_summary(gens)
for pattern, count in sorted(summary.items(), key=lambda x: -x[1])[:5]:
print(f"{pattern}: {count} generators -> 3.8% of training")
Validate
engram-validate --all --samples 20
engram-validate --skill-tree
engram-validate --task addition --difficulty 5 --samples 100
Testing
python -m pytest tests/ -v
6,326 tests across 16 test modules:
- Sanity (6,066): every generator at low difficulty, high difficulty, and determinism
- Correctness (75): independent mathematical verification
- Structural (185): no orphans, no dangling prerequisites, no backwards cross-tier deps
- Coverage: 99% (77,452 statements, 1,104 missed)
Roadmap
Current: v0.1.0 -- 2,022 generators, 100+ domains, 26 reasoning patterns
Planned:
- Code generation -- generators that output executable code (Python, pseudocode), verified by sandboxed execution
- Tool calling -- generators that produce structured tool-call sequences from task descriptions
- Agentic reasoning -- multi-step observation-action-reward chains for planning and tool use
- 5,000+ generators -- deeper coverage of existing domains, plus medicine, law, philosophy, and linguistics
- Multi-language output -- same algorithms, different natural language task descriptions
- Difficulty auto-scaling -- dynamic difficulty adjustment based on model accuracy curves
License
MIT
Organisation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file engram_generator-0.1.0.tar.gz.
File metadata
- Download URL: engram_generator-0.1.0.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a29a0a61a6ff9e88c848eaaf16a5119f861e0ac0c9d9b2e3c37324ecec2c44f
|
|
| MD5 |
6cf852b831be9e1bd262a1a94933138d
|
|
| BLAKE2b-256 |
abd8bd084d7baa50abf334bc67271b1f56e0e5f1917d3f477c3ee8be4a47bc24
|
File details
Details for the file engram_generator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: engram_generator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41bcac527b5fc9717a7e15b6b31d52ef517e504e06259844bfe886804d8600f3
|
|
| MD5 |
bda25c1b94010c82c3ca161a6e5b0ca2
|
|
| BLAKE2b-256 |
c0ae9e62f7e7eeefdba916a75058b62d037b801b16f512e2bcc51f47d53e46e4
|