Skip to main content

Symbolic regression by genetic programming (C++ engine, Python bindings)

Project description

eqhunt

pip install eqhunt

Symbolic regression by genetic programming. C++ engine, Python bindings via nanobind.

Give it a table of (inputs, target) pairs; it returns a human-readable formula that approximates the relationship. No neural network, no black box — just an algebraic expression you can read, paste into a calculator, or hand-tune.

import eqhunt

X = [[1, 1], [2, 3], [4, 5], [7, 2], [9, 9]]
y = [2, 5, 9, 9, 18]

model = eqhunt.fit(X, y)
print(model.formula)        # e.g.  f(x,y) = (x+y)
print(model.error)          # e.g   0.0
print(model.predict([6, 7])) # -> 13.0

Install

pip install eqhunt

Prebuilt wheels are published for Linux, macOS and Windows on common Python versions. If pip falls back to building from source you'll need a C++17 compiler.

Two ways to use it

Ultra-simple

import eqhunt

model = eqhunt.fit(X, y, generations=5000)
print(model.formula)
model.predict([1, 2])         # single row
model.predict([[1, 2], [3, 4]])  # batch

fit() accepts any Config field as a keyword argument:

eqhunt.fit(X, y, pop=800, trig_penalty=2.0, bloat_penalty=0.3)

Fully configurable

import eqhunt

cfg = eqhunt.Config()
cfg.pop               = 800
cfg.gen               = 50000
cfg.tournament_size   = 5
cfg.initial_depth     = 5
cfg.bloat_penalty     = 0.3
cfg.trig_penalty      = 1.5
cfg.accepted_error    = 0.01

# Re-weight individual operators (higher = more likely to appear)
cfg.op_weights.sin = 1.0      # boost sine
cfg.op_weights.cos = 1.0
cfg.op_weights.exp = 0.0      # disable exp entirely
cfg.pi_prob = 0.10            # 'pi' more frequent in terminals

model = eqhunt.Model(cfg).fit(X, y)
print(model.formula)

You can also train from a CSV file (one row per sample, last column = target, lines starting with # are comments):

eqhunt.Model().fit_csv("nivel_embase.csv")

Operators available

Category Operators
Arithmetic + - * / -x
Powers sqrt **
Conditional if(cond, then, else) (cond > 0)
Trig sin cos tan
Exp / log exp log
Constants numeric literals, pi

Trigonometric, log and exp nodes have low default weights so they only appear after enough mutation pressure — useful for cyclic / physical data, ignored otherwise. Adjust via Config.op_weights.

How error and validity are handled

  • Per-sample error is |prediction - target|; total error is the sum.
  • Invalid evaluations (/0, sqrt(<0), log(<=0), exp(huge)) get a soft per-sample penalty rather than killing the whole formula — a single out-of-domain sample no longer disqualifies an otherwise good candidate. If more than 25% of samples fail, the formula is rejected.

Stopping early

Config.accepted_error stops the search as soon as total error drops below the threshold. You can also call model.stop() from another thread (or a signal handler) to ask the loop to wrap up after the current generation.

Saving and reloading a formula

A trained model is just a string — you can persist it, ship it, paste it, diff it. To reuse a formula in a new process without retraining, parse it back into a Model:

import eqhunt

# train and save
m = eqhunt.fit(X, y)
print(m.formula)          # e.g.  f(x,y) = ((x*x) - (y*y))
m.save("model.txt")       # one-liner persisted

# later, in a fresh process — no training needed
m2 = eqhunt.Model.load_file("model.txt")
m2.predict([6, 7])        # -13.0
m2.predict([[1, 2], [3, 4]])

You can also go through strings directly:

formula_str = m.formula                   # or any equivalent expression
m3 = eqhunt.Model.from_formula(formula_str)
m3.predict([12, 5])

Or mutate an existing model in place:

m.load_formula("(x*x + y*y)")             # replaces the current tree

Accepted syntax: anything the engine itself emits via get_formula() — arithmetic (+ - * / **), unary minus, sqrt sin cos tan log exp if, variables x y z w v u x6 x7 …, numeric literals (int / float / 1e5), and pi. Both the bare expression ("(x+y)") and the full prefixed form ("f(x,y) = (x+y)") are accepted; the parser strips everything up to and including the first =. Parse errors raise RuntimeError.

The number of input variables is inferred from the highest variable index in the formula, so m2.num_vars is set correctly without needing to know it in advance.

Config reference

Field Default Meaning
pop 400 Population size
gen 15000 Max generations
tournament_size 4 Tournament selection pool
crossover_prob 0.7 Crossover probability per pair
mutation_prob 0.25 Mutation probability per offspring
initial_depth 4 Depth used to seed the initial population
mutation_depth 3 Depth for mutation-generated subtrees
const_min/max -9, 9 Range for random numeric terminals
pi_prob 0.01 Probability a terminal is pi
bloat_penalty 0.1 Per-node penalty (favours smaller trees)
trig_penalty 0.5 Extra penalty per sin/cos/tan/log/exp node
immigrant_rate 0.05 Fraction of population replaced by random each gen
weak_parent_rate 0.2 Prob. 2nd parent is random (not tournament)
accepted_error 0.5 Stop training once total error < this value
verbose False* Print best-so-far per improvement
simplify True Run algebraic simplification on the final tree
simplify_interval 500 Periodically simplify top-N members during training
simplify_top_n 10 How many to simplify periodically

*C++ default is True; the Python fit() helper defaults to False.

Building from source

git clone https://github.com/sha0coder/eqhunt
cd eqhunt
pip install -e .
pytest

Requires Python 3.8+, a C++17 compiler, CMake 3.15+.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eqhunt-0.0.4.tar.gz (19.6 kB view details)

Uploaded Source

File details

Details for the file eqhunt-0.0.4.tar.gz.

File metadata

  • Download URL: eqhunt-0.0.4.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for eqhunt-0.0.4.tar.gz
Algorithm Hash digest
SHA256 34fe69a0e1b98eee05d8309e15d93081977a0240f99f2f3d734c04918d46aa4b
MD5 71d98a0376a481bec71e82958ea5887f
BLAKE2b-256 ace6f96802ce29537aa5db14b29d3018bf9629a15fdafb0cd2164c3041d586c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page