Skip to main content

Procedural synthetic dataset generator for training reasoning AI — 2,022 generators across 100+ scientific domains

Project description

  _____ _   _  ____ ____      _    __  __
 | ____| \ | |/ ___|  _ \    / \  |  \/  |
 |  _| |  \| | |  _| |_) |  / _ \ | |\/| |
 | |___| |\  | |_| |  _ <  / ___ \| |  | |
 |_____|_| \_|\____|_| \_\/_/   \_\_|  |_|
          G E N E R A T O R

2,022 generators. 100+ scientific domains. 10^81 unique problems.

A procedural dataset that encodes the breadth of human scientific knowledge -- mathematics, physics, chemistry, biology, computer science, engineering, quantum theory, earth sciences, economics, logic, and more -- as step-by-step reasoning problems. The goal is not to build a benchmark. The goal is to teach machines how humans reason, discover, and invent.

Every problem is generated on-the-fly. Every answer is correct by construction. There is no dataset file -- just code that writes an endless exam across every discipline humans have formalised.

This repository contains AI-generated code, reviewed and directed by a human author.

pip install engram-generator

The Problem

Models trained on static datasets learn to pattern-match, not to reason. Train a model on 10,000 addition problems and it learns a lookup table, not addition. Change the digit count and it breaks. That's memorisation pretending to be intelligence.

Human reasoning didn't develop by memorising answers. It developed by solving problems across domains -- by recognising that the same recursive structure appears in Fibonacci sequences, merge sort, and mathematical induction. That a conservation law works the same way in thermodynamics, circuit analysis, and chemical equilibria. That a proof by contradiction in logic uses the same mental move as a reducibility argument in computability theory.

Engram Generator encodes this cross-domain structure:

  • 10^81 unique problems -- more than atoms in the observable universe
  • Step-by-step solutions -- show your work or fail
  • 26 reasoning strategies with balanced exposure -- no single trick works
  • 100+ scientific domains -- the breadth of formalised human knowledge
  • Adaptive difficulty -- the curriculum escalates as the model improves
  • Must Level Up -- advanced tasks are locked behind mastery of prerequisites
  • Provably correct -- every answer is generated by the same algorithm that generated the problem

The Arc

A model trained on this curriculum climbs from counting to self-awareness:

Tier 0   "2 + 3 = 5"
  |
Tier 2   "d/dx(3x^2 + 2x) = 6x + 2"
  |
Tier 5   "curl F = (dFz/dy - dFy/dz, ...)"
  |
Tier 7   "This proof has an error in step 3. Here is the correction."
  |
Tier 8   "These two problems share an isomorphic structure."
  |
Tier 9   "To solve this class of problems, I would design the following algorithm."
  |
Tier 10  "My architecture struggles with length generalisation.
          Here is a proposed modification."

From following procedures to creating them. From solving problems to understanding what makes problems solvable.

What's Inside

Domain Generators Highlights
Mathematics 730+ Arithmetic through category theory, PDEs, algebraic geometry, measure theory, homological algebra
Physics 200+ Classical mechanics to quantum field theory, plasma physics, particle physics, general relativity
Computer Science 230+ Algorithms, cryptography, compilers, distributed systems, ML theory, formal verification
Chemistry 80+ General, organic, physical, spectroscopy, polymer science, electrochemistry
Biology & Health 90+ Genetics, biochemistry, epidemiology, neuroscience, systems biology, pharmacology
Engineering 100+ Signal processing, control theory, semiconductors, photonics, aerospace, structural
Quantum 50+ Formalism, information theory, field theory, error correction
Earth & Space 30+ Astronomy, geology, oceanography, geophysics, climate science
Social & Cognitive 50+ Economics, game theory, linguistics, causal inference, cognitive science
Logic & Foundations 50+ Formal logic, model theory, computability, proof theory, set theory
Other 100+ Music theory, financial maths, medical imaging, persistent homology, wavelet theory

Levelling System

You don't get to attempt RSA encryption until you can do modular arithmetic. You don't get to critique a proof until you can write one. The skill tree enforces this:

Tier Tasks What You Unlock Examples
0 20 Fundamentals Addition, subtraction, sorting, boolean logic
1 36 Building blocks Multiplication, Fibonacci, Caesar cipher
2 47 Algebra & graphs Derivatives, quadratics, graph reachability
3 95 Real maths Integrals, determinants, boolean algebra
4 313 Applied science Physics, probability, dynamic programming
5 730 Expert territory PDEs, cryptography, quantum mechanics
6 521 Graduate level Topology, general relativity, information theory
7 176 Meta-reasoning Proof strategy, error detection, generalisation
8 31 Creative Conjecture, isomorphism detection
9 29 Research Algorithm design, impossibility proofs
10 24 Self-architecture Scaling laws, architecture search, loss design

Reasoning Balance

Without balancing, formula-substitution problems (55% of generators) would dominate training. The model would learn to plug numbers into equations and call it a day.

Instead, training is balanced across 26 reasoning strategies. Each gets equal exposure regardless of how many generators belong to it:

Pattern Generators What it teaches
Formula substitution 1,188 Plug values into known equations
Meta-reasoning 112 Proof strategy, error analysis, architecture design
Probabilistic reasoning 87 Bayes, distributions, expected values, stochastic processes
Differential equations 76 ODEs, PDEs, boundary value problems, numerical methods
Graph traversal 73 BFS, DFS, Dijkstra, flow networks, connectivity
Simulation trace 60 State machines, data structures, protocol execution
Symbolic manipulation 49 Differentiation, integration, algebraic simplification
Construction & verification 39 Group axioms, homomorphisms, topological invariants
Geometric computation 34 Areas, volumes, intersections, convex hulls
Conservation & balance 33 Thermodynamic laws, Kirchhoff, chemical equilibria
Linear algebra 31 Matrix decomposition, eigenvalues, null spaces
Counting & enumeration 28 Permutations, Catalan numbers, inclusion-exclusion
Statistical inference 26 Hypothesis testing, confidence intervals, regression
Logical deduction 26 Natural deduction, resolution, sequent calculus
Modular arithmetic 22 CRT, Euler's totient, discrete logarithms
Transform methods 20 Fourier, Laplace, Z-transform, wavelets
Dynamic programming 19 Optimal substructure, memoisation, alignment
Optimization 19 Gradient descent, KKT conditions, convex methods
Series & convergence 19 Ratio test, power series, uniform convergence
Encoding & decoding 15 RSA, Huffman, Reed-Solomon, stream ciphers
Approximation & numerical 14 Newton-Raphson, quadrature, interpolation
Recursive decomposition 11 Divide-and-conquer, Tower of Hanoi, merge sort
Comparison & ordering 9 Periodic trends, mineral identification, ranking
Dimensional analysis 8 Unit conversion, significant figures, calibration
Greedy selection 3 Interval scheduling, bin packing, set cover
Search & backtracking 1 A*, constraint satisfaction

Formula substitution has 1,188 generators. Search & backtracking has 1. But during training, both patterns get 3.8% of samples. No free rides.

Why Memorisation is Impossible

The entire curriculum is 1.85 MB of algorithms. It produces terabytes of unique instances. That's a compression ratio of 1,250,000:1.

Difficulty range Unique problems For scale...
d=1 only ~10^12 More than all Google searches ever
d=1-4 ~10^41 Grains of sand on Earth, squared
d=1-8 (full) ~10^81 Atoms in the observable universe

Even the largest models can't put a dent in it:

Model Parameters Can memorise Coverage of 10^81
GPT-2 124,000,000 ~134,000 10^-76
Llama-2 7B 7,000,000,000 ~7.5M 10^-74
Llama-2 70B 70,000,000,000 ~75M 10^-73
GPT-4 (est. ~1.8T) 1,800,000,000,000 ~1.9B 10^-72
Llama-3.1 405B 405,000,000,000 ~438M 10^-72

GPT-4, estimated at 1.8 trillion parameters, could memorise roughly 2 billion samples. The dataset has 10^81. The gap is 72 orders of magnitude. The only winning strategy is to learn the algorithms.

And here's the kicker: the algorithmic information (1.85 MB) fits inside even a 1M parameter model with 14x headroom. Models can store every algorithm. They cannot store even a billionth of the instances.

Tokenizer

All mathematical notation is written in LaTeX. The model learns to read and write LaTeX as a native language -- fractions, integrals, matrices, Greek letters (spelled out), superscripts, subscripts, and nested expressions. This means a model trained on Engram Generator doesn't just learn to solve maths -- it learns the standard notation that humans use to communicate it.

\frac{d}{dx}(-x^2-2x-2) <step> -1*2x=-2x <step> -2*1=-2 <step> 0 <step> -2x-2

\begin{pmatrix} -5 & 3 \\ 2 & 2 \end{pmatrix} \times \begin{pmatrix} -1 & -2 \\ -3 & 8 \end{pmatrix}

\oint_{|z|=3} \frac{1}{z^{2}+z-6} dz <step> poles: z=-3, 2 <step> Res(f,2)=0.2 <step> 1.2566i

Engram Generator uses a character-level tokenizer -- every character maps to exactly one token. No subword merging. No BPE. No SentencePiece.

Why? Subword tokenizers destroy the structure that reasoning depends on:

  • Digit atomicity: BPE merges "123" into a single token. The model can't see that the 3 is in the ones place and the 1 is in the hundreds place. Arithmetic becomes impossible. Character-level tokenization keeps every digit separate, so carry operations and place-value reasoning work naturally.
  • LaTeX preservation: LaTeX uses nested braces, superscripts, and subscripts (\frac{d}{dx}, x^{2}). Subword tokenizers split these unpredictably -- \frac might become \fr + ac, breaking the command boundary. Character-level tokenization preserves brace matching, command names, and operator structure exactly as written.
  • Deterministic alignment: Every character is exactly one token. No ambiguity about tokenization boundaries. The model's attention patterns can align precisely with the mathematical structure of the problem.

The character set (132 characters + 3 special tokens = 135 vocab):

Category Characters
Digits (10) 0 1 2 3 4 5 6 7 8 9
Lowercase (26) a b c ... z
Uppercase (26) A B C ... Z
Greek (12) α β γ δ ε θ λ μ π σ φ ω
Arithmetic (5) + - * / ^
Relations (4) ≤ ≥ ≠ ≈
Grouping (6) ( ) [ ] { }
Calculus & analysis (4) ∂ ∫ √ ∞
Set theory (5) ∈ ⊂ ∅ ∩ ∪
Logic (9) ∀ ∃ ¬ ∧ ∨ ⊢ ⊨ ↔ ⊥
Punctuation (9) = : ; ? . , ! ' "
LaTeX & structure (7) \ _ | ~ < > %
Other (9) # @ $ & ° × — → (space)
Special tokens (3) <pad> <eos> <step>

The <step> token separates solution steps in the target sequence. All generator output is constrained to use only characters in this set -- any generator that produces a character outside it is a bug and is caught by the test suite.

Samples

Input:  add two 5 digit numbers
Target: 13278 + 46048 <step> 8+8=16 <step> 7+4+1=12 <step> 2+0+1=3 <step> 3+6=9 <step> 1+4=5 <step> 59326
  • Input: natural language task description
  • Target: problem, solution steps, and answer separated by <step> tokens
  • Both capped at 512 characters

Usage

Generate samples

from engram_generator.curriculum.registry import get_generator

gen = get_generator("addition", min_difficulty=3, max_difficulty=5)
samples = gen.generate(100)

for sample in samples[:3]:
    print(f"Input:  {sample.input_text}")
    print(f"Target: {sample.target_text}")
    print(f"Answer: {sample.answer}")

Use the skill tree

from engram_generator.curriculum.registry import get_all_generators
from engram_generator.curriculum.skill_tree import SkillTree

generators = get_all_generators()
tree = SkillTree(generators, retention_ratio=0.1)

# See what's unlocked
print(tree.get_unlocked_tasks())

# Level up by proving mastery
events = tree.update({"addition": 0.97, "subtraction": 0.85})

Balanced training

from engram_generator.curriculum.reasoning_patterns import (
    get_pattern_weights, get_pattern_summary,
)
from engram_generator.curriculum.registry import get_all_generators

gens = get_all_generators()
weights = get_pattern_weights(gens)

# Each of the 26 reasoning patterns gets equal training exposure
summary = get_pattern_summary(gens)
for pattern, count in sorted(summary.items(), key=lambda x: -x[1])[:5]:
    print(f"{pattern}: {count} generators -> 3.8% of training")

Validate

engram-validate --all --samples 20
engram-validate --skill-tree
engram-validate --task addition --difficulty 5 --samples 100

Testing

python -m pytest tests/ -v

6,326 tests across 16 test modules:

  • Sanity (6,066): every generator at low difficulty, high difficulty, and determinism
  • Correctness (75): independent mathematical verification
  • Structural (185): no orphans, no dangling prerequisites, no backwards cross-tier deps
  • Coverage: 99% (77,452 statements, 1,104 missed)

Roadmap

Current: v0.1.0 -- 2,022 generators, 100+ domains, 26 reasoning patterns

Planned:

  • Code generation -- generators that output executable code (Python, pseudocode), verified by sandboxed execution
  • Tool calling -- generators that produce structured tool-call sequences from task descriptions
  • Agentic reasoning -- multi-step observation-action-reward chains for planning and tool use
  • 5,000+ generators -- deeper coverage of existing domains, plus medicine, law, philosophy, and linguistics
  • Multi-language output -- same algorithms, different natural language task descriptions
  • Difficulty auto-scaling -- dynamic difficulty adjustment based on model accuracy curves

License

MIT

Organisation

www.engram.one · www.deepnet.one

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

engram_generator-0.1.0.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

engram_generator-0.1.0-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file engram_generator-0.1.0.tar.gz.

File metadata

  • Download URL: engram_generator-0.1.0.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for engram_generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6a29a0a61a6ff9e88c848eaaf16a5119f861e0ac0c9d9b2e3c37324ecec2c44f
MD5 6cf852b831be9e1bd262a1a94933138d
BLAKE2b-256 abd8bd084d7baa50abf334bc67271b1f56e0e5f1917d3f477c3ee8be4a47bc24

See more details on using hashes here.

File details

Details for the file engram_generator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for engram_generator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41bcac527b5fc9717a7e15b6b31d52ef517e504e06259844bfe886804d8600f3
MD5 bda25c1b94010c82c3ca161a6e5b0ca2
BLAKE2b-256 c0ae9e62f7e7eeefdba916a75058b62d037b801b16f512e2bcc51f47d53e46e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page