Skip to main content

A Symbolic Regression engine

Project description

SymReg is a Symbolic Regression library aimed to be easy to use and fast.

You can use it to find expressions trying to explain a given output from given inputs. The expressions can use arbitrary building blocks, not just weighted sums as in linear models.

It works with a modified NSGA-II algorithm, and applies NumPy functions for vectorized evaluation of input.

SymReg is available on PyPI: you can install it using pip install symreg.

Usage demo

Simplest use

from symreg import Regressor
import random

random.seed(0)

r = Regressor(stagnation_limit=20, verbose=True)
X = [[random.random()-.5, random.random()-.5] for _ in range(100)]
y = [(x[0] + x[1])/2 for x in X]     # We want the average of the arguments

r.fit(X, y)

for score in r.results():
    print(score)

# {'error': 0.19027682154977618, 'complexity': 1, 'program': Program('$0', 2)}
# {'error': 0.13948173984343024, 'complexity': 3, 'program': Program('div $0 1.8705715685399509', 2)}
# {'error': 0.0, 'complexity': 5, 'program': Program('div add $0 $1 2.0', 2)}

r.predict([4, 6])
# array([5.])
# The mean of 4 and 6 is 5. Correct!

r.predict([[4, 6], [1, 2]])
# array([5. , 1.5])
# It also handles vectorized data.
# Here, $0=4 and $1=6 for the first row, and $0=1 and $1=2 for the second row in the 2d array.

Adding a custom building block

import symreg
import random
import numpy as np
symreg.ga.blocks['xor'] = (lambda x, y: np.logical_xor(x, y).astype(int), 2)

random.seed(0)
r = symreg.Regressor(generations=200)
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]

r.fit(X, y)
# There should be an xor operation in the results:
print(r.results())

You can see more examples of usage in test_regressor.py or in the Jupyter Notebook file.

Inspiration

The following projects inspired me:

  • Eureqa Formulize - a proprietary piece of otherwise-great software, which is not available anymore
  • gplearn - which is Free Software and offers strict scikit-learn compatibility (support pipeline and grid search), but does not support multiobjective optimization

Contrary to gplearn, I decided to avoid depending on scikit-learn for implementation simplicity, but still keep the general API of "fit" and "predict", which is intuitive.

Additionally, symreg does not perform function closures (patching of infinities). All candidates are passed as-is to NumPy, relying on the fitness function to eliminate numerically unstable ones (an error of infinity or NaN is infinitely penalized).

Technical details

By using a Pareto frontier instead of a single criterion, we can use elitism (keeping the best alive forever), but also maintain genetic diversity.

In the 2D case, I modified NSGA-II to use a faster O(N*log(N)) Pareto algorithm, because the general N-dimensional algorithm from the paper has poor time complexity.

I include the density penalty from NSGA-II, which is fast, and helps further diversify by spreading individuals throughout the frontier.

I do not use NumPy where it is not appropriate. When dealing with lots of small structures, like the scores of a generation which are length 2 each, the line profiler showed a better execution time with plain lists or tuples.

As with many other regression algorithms, I recommend that the input is scaled before training. This is especially true while SymReg does not have a good regression method for constants.

Still, I personally prefer to avoid changing the sign of data - it makes interpreting the resulting mathematical functions in one's mind more difficult.

Performance

While symreg is faster than gplearn, I suspect there can be a significant improvement by using multithreading. The problem is, due to the Global Interpreter Lock, and to the fact that genetic programming code is CPU bound, we can't easily parallelize the code.

Do you have or know any easy ways to get multithreading? Please share them as a GitHub issue or via e-mail so I can further optimize.

Benchmarking was done on the sklearn diabetes dataset. To prevent thermal variance, I used a single core of an Intel(R) Core(TM) i7-4710HQ CPU on power saver, with TurboBoost off. Only addition, subtraction, multiplication, and division were allowed. See the benchmark_vs_gplearn branch for the specific code.

Parameters

I tuned metaparameters according to held-out test error on the Boston dataset (see the Metaopt notebook and the bottom of metaopt.ods).

Other analyzed parameters don't give out such a clear signal. The number of individuals in a generation, n, is around 65 in both the top 30 individuals, and the worst quartile. The rest seem not to have a clear influence within the tested range. Perhaps this would change if I tested in a different range.

As always, we must take precautions against overfitting. Always use a validation set and a test set, especially with such flexible and complex models like Genetic Programming.

While constraining oneself to use simpler expressions can have some regularization effect, never look at the test set until the end (and you are sure you won't modify anything else), and only then can you discover the true performance of your algorithm.

Engineering & Architecture

I used Test-Driven Development during implementation, and enjoyed it thoroughly.

The "unit" of a unit test is not necessarily a method, but a stable interface, through which the tests must do their job (including testing for detailed behavior). This is in order to keep the implementation flexible; however, testing from too far away might make it harder to debug a failing test.

That does not mean that we allow the tests to be slow. One should keep the test suite running in about a second, because it allows us to refactor easily, while being confident the code stays correct.

Some nondeterministic high-level tests may fail from time to time (test_symreg). If they pass at least once since changing your code, the code can produce what they require, and you should think of them as passing.

In addition to TDD, I strive to respect Gary Berhnardt's Imperative Shell/Functional Core, and Uncle Bob's Clean Architecture. These two are integrated and succintly explained in the author's pretentiously-named blog post here.

These principles enable quick experimentation. I tried 3 domination-frontier algorithms, and decided to keep 2 of them. I barely had to modify the tests, because I tested through the stable, higher-level methods also used by the other classes. I did use the debugger a bit, though.

Running all tests can be done with pytest. You can make the system pretend it's installed with python setup.py develop.

Next steps

The author wishes to eventually implement the following further features (but pull requests are welcome as well, of course):

  • Multiprocessing (threading is not enough, because we're CPU bound and there is the GIL).
    • Problem: new process takes about 200ms. Program evaluation must take longer than that to be worth it. It might be, for very large problems. I tried multiprocessing and multithreading on the parallel branch (in hopes NumPy parallelizes computation inside), but it is slower than a single thread. If you have any tips on getting NumPy to run multithreaded, please share them.
  • Split validation data from training data
    • stopping criterion on validation error stagnation (failure to improve in X generations)
  • Better printing of programs (with parentheses, or infix notation, graphviz like GPLearn, or spreadsheet formulas...)
  • Gradient descent for constants
    • Can be implemented as a mutation; but must tune its performance (i.e., how often should it happen, how much time to spend etc.)
    • Consider Adam optimizer
  • Evaluation caching
    • Remember last N program scores in a dictionary regardless of score
  • Allow richer predictions, returning percentiles (which would inform of model uncertainty)
  • Allow fit_partial straight from symreg, to continue training from an interactive session
  • Pretty plots while training
    • Perhaps a UI like Formulize?

Feedback is appreciated. Please comment as a GitHub issue, or any other way (you can contact the author directly here). GitHub issues are also the preferred method for support requests, because it allows others to learn from your questions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symreg-0.0.5.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

symreg-0.0.5-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file symreg-0.0.5.tar.gz.

File metadata

  • Download URL: symreg-0.0.5.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.6

File hashes

Hashes for symreg-0.0.5.tar.gz
Algorithm Hash digest
SHA256 95e8beeeab9ebb3bbf9589e6e73c881d3333505135da80b7256648bc28f5c8be
MD5 bae79518ac3b3659014eac8e36558572
BLAKE2b-256 a31d2f82561f369973efa439452d1b096fb24100ade1ea3f0489ff83e125ccab

See more details on using hashes here.

File details

Details for the file symreg-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: symreg-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.6

File hashes

Hashes for symreg-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6b2471f70ce1fdfa21072c7d34e485b6c998d73f6059da4650b53ff54ed90639
MD5 33c5b8c58aa0266f24bf07e9f1b774d1
BLAKE2b-256 fd32be2d08d19eed0bfd91ca5cb056de096e88688ffe4e9818520b727de44c5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page