Simple and efficient symbolic regression

These details have not been verified by PyPI

Project links

Homepage

Project description

PySR.jl

Symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Symbolic regression is a very interpretable machine learning algorithm for low-dimensional problems: these tools search equation space to find algebraic relations that approximate a dataset.

One can also extend these approaches to higher-dimensional spaces by using a neural network as proxy, as explained in https://arxiv.org/abs/2006.11287, where we apply it to N-body problems. Here, one essentially uses symbolic regression to convert a neural net to an analytic equation. Thus, these tools simultaneously present an explicit and powerful way to interpret deep models.

Backstory:

Previously, we have used eureqa, which is a very efficient and user-friendly tool. However, eureqa is GUI-only, doesn't allow for user-defined operators, has no distributed capabilities, and has become proprietary (and recently been merged into an online service). Thus, the goal of this package is to have an open-source symbolic regression tool as efficient as eureqa, while also exposing a configurable python interface.

Installation

PySR uses both Julia and Python, so you need to have both installed.

Install Julia - see downloads, and then instructions for mac and linux. Then, at the command line, install the Optim and SpecialFunctions packages via:

julia -e 'import Pkg; Pkg.add("Optim"); Pkg.add("SpecialFunctions")'

For python, you need to have Python 3, numpy, sympy, and pandas installed.

You can install this package from PyPI with:

pip install pysr

Quickstart

import numpy as np
from pysr import pysr

# Dataset
X = 2*np.random.randn(100, 5)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2

# Learn equations
equations = pysr(X, y, niterations=5,
            binary_operators=["plus", "mult"],
            unary_operators=["cos", "exp", "sin"])

...

print(equations)

which gives:

   Complexity       MSE                                                Equation
0           5  1.947431                          plus(-1.7420927, mult(x0, x0))
1           8  0.486858           plus(-1.8710494, plus(cos(x3), mult(x0, x0)))
2          11  0.000000  plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3)))

The newest version of PySR also returns three additional columns:

score - a metric akin to Occam's razor; you should use this to help select the "true" equation.
sympy_format - sympy equation.
lambda_format - a lambda function for that equation, that you can pass values through.

Custom operators

One can define custom operators in Julia by passing a string:

equations = pysr.pysr(X, y, niterations=100,
    binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
    extra_sympy_mappings={'special': lambda x, y: x**2 + y},
    unary_operators=["cos"])

Now, the symbolic regression code can search using this special function that squares its left argument and adds it to its right. Make sure all passed functions are valid Julia code, and take one (unary) or two (binary) float32 scalars as input, and output a float32. This means if you write any real constants in your operator, like 2.5, you have to write them instead as 2.5f0, which defines it as Float32. Operators are automatically vectorized.

We also define extra_sympy_mappings, so that the SymPy code can understand the output equation from Julia, when constructing a useable function. This step is optional, but is necessary for the lambda_format to work.

One can also edit operators.jl. See below for more options.

Weighted data

Here, we assign weights to each row of data using inverse uncertainty squared. We also use 10 processes instead of the usual 4, which creates more populations (one population per thread).

sigma = ...
weights = 1/sigma**2

equations = pysr.pysr(X, y, weights=weights, procs=10)

Operators

All Base julia operators that take 1 or 2 float32 as input, and output a float32 as output, are available. A selection of these and other valid operators are stated below. You can also define your own in operators.jl, and pass the function name as a string.

Your operator should work with the entire real line (you can use abs(x) - see logm); otherwise the search code will be slowed down with domain errors.

Binary

plus, mult, pow, div, greater, mod, beta, logical_or, logical_and

Unary

neg, exp, abs, logm (=log(abs(x) + 1e-8)), logm10 (=log10(abs(x) + 1e-8)), logm2 (=log2(abs(x) + 1e-8)), sqrtm (=sqrt(abs(x))) log1p, sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh, erf, erfc, gamma, relu, round, floor, ceil, round, sign.

Full API

What follows is the API reference for running the numpy interface. You likely don't need to tune the hyperparameters yourself, but if you would like, you can use hyperparamopt.py as an example. However, you should adjust procs, niterations, binary_operators, unary_operators, and maxsize to your requirements.

The program will output a pandas DataFrame containing the equations, mean square error, and complexity. It will also dump to a csv at the end of every iteration, which is hall_of_fame.csv by default. It also prints the equations to stdout.

pysr(X=None, y=None, weights=None, procs=4, populations=None, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, extra_sympy_mappings={}, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20, fast_cycle=False, maxdepth=None, variable_names=[], select_k_features=None, threads=None, julia_optimization=3)

Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i. Note: most default parameters have been tuned over several example equations, but you should adjust threads, niterations, binary_operators, unary_operators to your requirements.

Arguments:

X: np.ndarray or pandas.DataFrame, 2D array. Rows are examples, columns are features. If pandas DataFrame, the columns are used for variable names (so make sure they don't contain spaces).
y: np.ndarray, 1D array. Rows are examples.
weights: np.ndarray, 1D array. Each row is how to weight the mean-square-error loss on weights.
procs: int, Number of processes (=number of populations running).
populations: int, Number of populations running; by default=procs.
niterations: int, Number of iterations of the algorithm to run. The best equations are printed, and migrate between populations, at the end of each.
ncyclesperiteration: int, Number of total mutations to run, per 10 samples of the population, per iteration.
binary_operators: list, List of strings giving the binary operators in Julia's Base, or in operator.jl.
unary_operators: list, Same but for operators taking a single Float32.
alpha: float, Initial temperature.
annealing: bool, Whether to use annealing. You should (and it is default).
fractionReplaced: float, How much of population to replace with migrating equations from other populations.
fractionReplacedHof: float, How much of population to replace with migrating equations from hall of fame.
npop: int, Number of individuals in each population
parsimony: float, Multiplicative factor for how much to punish complexity.
migration: bool, Whether to migrate.
hofMigration: bool, Whether to have the hall of fame migrate.
shouldOptimizeConstants: bool, Whether to numerically optimize constants (Nelder-Mead/Newton) at the end of each iteration.
topn: int, How many top individuals migrate from each population.
nrestarts: int, Number of times to restart the constant optimizer
perturbationFactor: float, Constants are perturbed by a max factor of (perturbationFactor*T + 1). Either multiplied by this or divided by this.
weightAddNode: float, Relative likelihood for mutation to add a node
weightInsertNode: float, Relative likelihood for mutation to insert a node
weightDeleteNode: float, Relative likelihood for mutation to delete a node
weightDoNothing: float, Relative likelihood for mutation to leave the individual
weightMutateConstant: float, Relative likelihood for mutation to change the constant slightly in a random direction.
weightMutateOperator: float, Relative likelihood for mutation to swap an operator.
weightRandomize: float, Relative likelihood for mutation to completely delete and then randomly generate the equation
weightSimplify: float, Relative likelihood for mutation to simplify constant parts by evaluation
timeout: float, Time in seconds to timeout search
equation_file: str, Where to save the files (.csv separated by |)
test: str, What test to run, if X,y not passed.
maxsize: int, Max size of an equation.
maxdepth: int, Max depth of an equation. You can use both maxsize and maxdepth. maxdepth is by default set to = maxsize, which means that it is redundant.
fast_cycle: bool, (experimental) - batch over population subsamples. This is a slightly different algorithm than regularized evolution, but does cycles 15% faster. May be algorithmically less efficient.
variable_names: list, a list of names for the variables, other than "x0", "x1", etc.
select_k_features: (None, int), whether to run feature selection in Python using random forests, before passing to the symbolic regression code. None means no feature selection; an int means select that many features.
julia_optimization: int, Optimization level (0, 1, 2, 3)

Returns:

pd.DataFrame, Results dataframe, giving complexity, MSE, and equations (as strings).

TODO

Async threading, and have a server of equations. So that threads aren't waiting for others to finish.
Print out speed of equation evaluation over time. Measure time it takes per cycle
Add ability to pass an operator as an anonymous function string. E.g., binary_operators=["g(x, y) = x+y"].
Add error bar capability (thanks Johannes Buchner for suggestion)
Why don't the constants continually change? It should optimize them every time the equation appears.
- Restart the optimizer to help with this.
Add several common unary and binary operators; list these.
Try other initial conditions for optimizer
Make scaling of changes to constant a hyperparameter
Make deletion op join deleted subtree to parent
Update hall of fame every iteration?
- Seems to overfit early if we do this.
Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
- (Added full insertion operator
Add a node at the top of a tree
Insert a node at the top of a subtree
Record very best individual in each population, and return at end.
Write our own tree copy operation; deepcopy() is the slowest operation by far.
Hyperparameter tune
Create a benchmark for accuracy
Add interface for either defining an operation to learn, or loading in arbitrary dataset.
- Could just write out the dataset in julia, or load it.
Create a Python interface
Explicit constant optimization on hall-of-fame
- Create method to find and return all constants, from left to right
- Create method to find and set all constants, in same order
- Pull up some optimization algorithm and add it. Keep the package small!
Create a benchmark for speed
Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
Record hall of fame
Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
Test performance of reduced precision integers
- No effect
Create struct to pass through all hyperparameters, instead of treating as constants
- Make sure doesn't affect performance
Rename package to avoid trademark issues
- PySR?
Put on PyPI
Treat baseline as a solution.
Print score alongside MSE: \delta \log(MSE)/\delta \log(complexity)
Calculating the loss function - there is duplicate calculations happening.
Declaration of the weights array every iteration
Sympy evaluation
Threaded recursion
Test suite
Performance: - Use an enum for functions instead of storing them?
- Gets ~40% speedup on small test.
Use @fastmath
Try @spawn over each sub-population. Do random sort, compute mutation for each, then replace 10% oldest.
Control max depth, rather than max number of nodes?
Allow user to pass names for variables - use these when printing
Check for domain errors in an equation quickly before actually running the entire array over it. (We do this now recursively - every single equation is checked for nans/infs when being computed.)
Sort these todo lists by priority

Feature ideas

Cross-validation
read the docs page
Sympy printing
Better cleanup of zombie processes after
Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
Call function to read from csv after running, so dont need to run again
Add function to plot equations
Refresh screen rather than dumping to stdout?
Add ability to save state from python
Additional degree operators?
Multi targets (vector ops). Idea 1: Node struct contains argument for which registers it is applied to. Then, can work with multiple components simultaneously. Though this may be tricky to get right. Idea 2: each op is defined by input/output space. Some operators are flexible, and the spaces should be adjusted automatically. Otherwise, only consider ops that make a tree possible. But will need additional ops here to get it to work. Idea 3: define each equation in 2 parts: one part that is shared between all outputs, and one that is different between all outputs. Maybe this could be an array of nodes corresponding to each output. And those nodes would define their functions.
Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
Consider printing output sorted by score, not by complexity.
Dump scores alongside MSE to .csv (and return with Pandas).
Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
Consider returning only the equation of interest; rather than all equations.
Enable derivative operators. These would differentiate their right argument wrt their left argument, some input variable.

Algorithmic performance ideas:

Idea: use gradient of equation with respect to each operator (perhaps simply add to each operator) to tell which part is the most "sensitive" to changes. Then, perhaps insert/delete/mutate on that part of the tree?
Start populations staggered; so that there is more frequent printing (and pops that start a bit later get hall of fame already)?
Consider adding mutation for constant<->variable
Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
Experiment with freezing parts of model; then we only append/delete at end of tree.
Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
For hierarchical idea: after running some number of iterations, do a search for "most common pattern". Then, turn that subtree into its own operator.
Calculate feature importances based on features we've already seen, then weight those features up in all random generations.
Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
- Store feature importances of future, and periodically update it.
Punish depth rather than size, as depth really hurts during optimization.

Code performance ideas:

Try defining a binary tree as an array, rather than a linked list. See https://stackoverflow.com/a/6384714/2689923
Add true multi-node processing, with MPI, or just file sharing. Multiple populations per core.
- Ongoing in cluster branch
Performance: try inling things?
Try storing things like number nodes in a tree; then can iterate instead of counting

mutable struct Tree
    degree::Array{Integer, 1}
    val::Array{Float32, 1}
    constant::Array{Bool, 1}
    op::Array{Integer, 1}
    Tree(s::Integer) = new(zeros(Integer, s), zeros(Float32, s), zeros(Bool, s), zeros(Integer, s))
end

Then, we could even work with trees on the GPU, since they are all pre-allocated arrays.
A population could be a Tree, but with degree 2 on all the degrees. So a slice of population arrays forms a tree.
How many operations can we do via matrix ops? Mutate node=>easy.
Can probably batch and do many operations at once across a population.
- Or, across all populations! Mutate operator: index 2D array and set it to random vector? But the indexing might hurt.
The big advantage: can evaluate all new mutated trees at once; as massive matrix operation.
- Can control depth, rather than maxsize. Then just pretend all trees are full and same depth. Then we really don't need to care about depth.
Can we cache calculations, or does the compiler do that? E.g., I should only have to run exp(x0) once; after that it should be read from memory.
- Done on caching branch. Currently am finding that this is quiet slow (presumably because memory allocation is the main issue).
Add GPU capability?
- Not sure if possible, as binary trees are the real bottleneck.
- Could generate on CPU, evaluate score on GPU?

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.19.4

Aug 23, 2024

0.19.3

Jul 29, 2024

0.19.2

Jul 15, 2024

0.19.1

Jul 15, 2024

0.19.0

Jun 22, 2024

0.18.5

Jun 16, 2024

0.18.4

May 4, 2024

0.18.3

Apr 26, 2024

0.18.2

Apr 15, 2024

0.18.1

Mar 26, 2024

0.18.0

Mar 24, 2024

0.17.4

Mar 21, 2024

0.17.3

Mar 20, 2024

0.17.2

Mar 12, 2024

0.17.1

Feb 13, 2024

0.17.0

Feb 12, 2024

0.16.9

Jan 5, 2024

0.16.8

Dec 31, 2023

0.16.7

Dec 31, 2023

0.16.6

Dec 24, 2023

0.16.5

Dec 14, 2023

0.16.4

Dec 13, 2023

0.16.3

Aug 21, 2023

0.16.2

Aug 17, 2023

0.16.1

Aug 10, 2023

0.16.0

Aug 7, 2023

0.15.4

Aug 4, 2023

0.15.3

Aug 2, 2023

0.15.2

Aug 1, 2023

0.15.1

Jul 30, 2023

0.15.0

Jul 28, 2023

0.14.3

Jul 4, 2023

0.14.2

Jun 20, 2023

0.14.1

May 28, 2023

0.14.0

May 20, 2023

0.13.0

May 12, 2023

0.12.3

Apr 27, 2023

0.12.2

Apr 22, 2023

0.12.1

Mar 25, 2023

0.12.0

Mar 21, 2023

0.11.17

Mar 7, 2023

0.11.16

Mar 1, 2023

0.11.15

Feb 18, 2023

0.11.14

Feb 13, 2023

0.11.13

Feb 9, 2023

0.11.12

Jan 16, 2023

0.11.11

Nov 22, 2022

0.11.10

Nov 21, 2022

0.11.9

Nov 5, 2022

0.11.8

Nov 4, 2022

0.11.7

Nov 3, 2022

0.11.6

Oct 31, 2022

0.11.5

Oct 24, 2022

0.11.4

Oct 10, 2022

0.11.3

Oct 6, 2022

0.11.2

Sep 28, 2022

0.11.1.post1

Sep 27, 2022

0.11.1

Sep 26, 2022

0.11.0

Sep 11, 2022

0.10.4.post1

Sep 8, 2022

0.10.4

Sep 8, 2022

0.10.3

Sep 6, 2022

0.10.2

Sep 6, 2022

0.10.1

Aug 14, 2022

0.10.0

Aug 14, 2022

0.9.5

Jul 9, 2022

0.9.4

Jul 8, 2022

0.9.3

Jun 22, 2022

0.9.2

Jun 20, 2022

0.9.1

Jun 6, 2022

0.9.0.post2

Jun 6, 2022

0.9.0.post1

Jun 4, 2022

0.9.0

Jun 4, 2022

0.8.7

May 25, 2022

0.8.6

May 22, 2022

0.8.5

May 20, 2022

0.8.4

May 12, 2022

0.8.3

May 9, 2022

0.8.2

May 8, 2022

0.8.1

May 8, 2022

0.8.0

May 8, 2022

0.7.13

May 7, 2022

0.7.12

May 4, 2022

0.7.11

Apr 26, 2022

0.7.10

Apr 10, 2022

0.7.9

Feb 23, 2022

0.7.8

Feb 23, 2022

0.7.7

Feb 15, 2022

0.7.6

Feb 14, 2022

0.7.5

Feb 13, 2022

0.7.4

Feb 13, 2022

0.7.3

Feb 12, 2022

0.7.2

Feb 5, 2022

0.7.1

Feb 5, 2022

0.7.0.post6

Feb 4, 2022

0.7.0.post5

Feb 3, 2022

0.7.0.post4

Feb 3, 2022

0.7.0.post3

Feb 2, 2022

0.7.0.post2

Feb 1, 2022

0.7.0.post1

Feb 1, 2022

0.7.0

Jan 31, 2022

0.7.0a3 pre-release

Jan 25, 2022

0.7.0a2 pre-release

Jan 22, 2022

0.7.0a1 pre-release

Jan 20, 2022

0.6.14

Nov 19, 2021

0.6.13

Sep 14, 2021

0.6.12.post1

Aug 4, 2021

0.6.12

Aug 4, 2021

0.6.11

Jul 12, 2021

0.6.10

Jun 19, 2021

0.6.9

Jun 16, 2021

0.6.8.post3

Jun 15, 2021

0.6.8.post2

Jun 14, 2021

0.6.8.post1

Jun 14, 2021

0.6.8

Jun 13, 2021

0.6.7

Jun 12, 2021

0.6.6

Jun 9, 2021

0.6.5

Jun 8, 2021

0.6.4

Jun 6, 2021

0.6.3

Jun 6, 2021

0.6.2

Jun 4, 2021

0.6.1

Jun 3, 2021

0.6.0.post1

Jun 1, 2021

0.6.0

Jun 1, 2021

0.6.0rc2 pre-release

May 31, 2021

0.6.0rc1 pre-release

May 30, 2021

0.5.16.post2

Mar 17, 2021

0.5.16.post1

Mar 16, 2021

0.5.16

Mar 7, 2021

0.5.13.post2

Mar 4, 2021

0.5.13.post1

Mar 1, 2021

0.5.13

Feb 27, 2021

0.5.10

Feb 26, 2021

0.5.9

Feb 25, 2021

0.5.2

Feb 24, 2021

0.5.0.post1

Feb 23, 2021

0.5.0

Feb 12, 2021

0.4.9

Feb 9, 2021

0.4.7

Feb 6, 2021

0.4.6

Feb 4, 2021

0.4.4

Feb 4, 2021

0.4.2.post1

Feb 3, 2021

0.4.2

Feb 3, 2021

0.4.0

Feb 1, 2021

0.3.37

Jan 19, 2021

0.3.36

Jan 3, 2021

0.3.35

Dec 30, 2020

0.3.34

Dec 23, 2020

0.3.33

Dec 22, 2020

0.3.32

Dec 22, 2020

0.3.31

Nov 18, 2020

0.3.30

Nov 9, 2020

0.3.29

Oct 31, 2020

0.3.28

Oct 19, 2020

0.3.27

Oct 16, 2020

0.3.26

Oct 16, 2020

0.3.25

Oct 13, 2020

0.3.24

Oct 11, 2020

0.3.23

Oct 11, 2020

0.3.22

Oct 11, 2020

0.3.21

Oct 10, 2020

This version

0.3.20

Oct 10, 2020

0.3.19

Oct 8, 2020

0.3.18

Oct 8, 2020

0.3.17

Oct 2, 2020

0.3.16

Oct 1, 2020

0.3.15

Sep 29, 2020

0.3.14

Sep 29, 2020

0.3.13

Sep 29, 2020

0.3.12

Sep 29, 2020

0.3.11

Sep 29, 2020

0.3.10

Sep 28, 2020

0.3.9

Sep 28, 2020

0.3.8

Sep 27, 2020

0.3.7

Sep 27, 2020

0.3.6

Sep 27, 2020

0.3.4

Sep 26, 2020

0.3.3

Sep 26, 2020

0.3.2

Sep 26, 2020

0.3.0

Sep 25, 2020

0.2.3

Sep 24, 2020

0.2.2

Sep 22, 2020

0.2.1

Sep 21, 2020

0.2.0

Sep 21, 2020

0.1.0

Sep 19, 2020

0.0.1

Sep 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysr-0.3.20.tar.gz (23.7 kB view hashes)

Uploaded Oct 10, 2020 Source

Built Distribution

pysr-0.3.20-py3-none-any.whl (26.7 kB view hashes)

Uploaded Oct 10, 2020 Python 3

Hashes for pysr-0.3.20.tar.gz

Hashes for pysr-0.3.20.tar.gz
Algorithm	Hash digest
SHA256	`4d267c38c34354af20711eb3142c02b1370a37d15bbf8d76c0899992679c5577`
MD5	`6ee2fd715a6ad680f31543799864e4c9`
BLAKE2b-256	`826c2fc90d5f697a397f16be3cd3d0b3ffe748074cc3539808c73605e9a383c8`

Hashes for pysr-0.3.20-py3-none-any.whl

Hashes for pysr-0.3.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`369c8fab4a7c60122497f5d63e3e942e2beab37eb6671030c714c4323b6c0f21`
MD5	`ab7e433a3046cdce763ffd012d049503`
BLAKE2b-256	`a5df63d6ed3b686815367256dcd055755f387f7873693e1e9620de167c5bb50c`