
As machine learning practitioners, we usually have a dataset (X, y), and we want to find a function M— also known as a model – such that M(X) ≈ y. Typically, we do not care about the functional form of M. As far as we are concerned, **** our model can be a neural network, a tree-based algorithm, or something completely different – as long as the performance on the test set is good, we are happy.
However, if we use complex models like these, we might miss out on interesting patterns, maybe even fundamental physics or economic laws within our data. In order to do better, I will show you how to build models using symbolic regression. **** These models have the property that they consist of only a few terms and can be (re-)implemented easily wherever you want. Let us see what I mean by that.
A Physics Experiment
Let us assume that we are an experimental physicist and want to find out how long it takes for an object to reach the ground when we drop it from some height h. For example, if you drop an object (that is heavy enough not to be influenced by air resistance) from a height of h = 1.5 m, it will take about t = 0.55 s until it reaches the ground. Try it out!
However, this is only true for Earth, or other celestial bodies with a gravitational acceleration of g = 9.8067 m/s². The moon, for example, has a gravitational acceleration of 1.625 m/s², and dropping the same object there from 1.5m would take a longer time of about 1.36s, which should align with what you know from movies.
Now, our task is to find a general formula t(h, g) that tells us how long the object needs to reach the ground. This is nothing more than building a model that takes the values height h and gravitational acceleration g and predicts time t. Dear physicists, please bear with me. 😃
Collect data and train a model
Let us assume that we were flying around with our spaceship, and dropped a few things from different heights and different planets, always measuring how long takes to reach the ground. Here are the first rows of the measurements:

We can now split it into train and test, and then train a standard machine learning model.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
data = pd.read_csv("https://raw.githubusercontent.com/Garve/towards_data_science/main/Symbolic%20Regression/free_fall.csv")
X = data[["h", "g"]]
y = data["t"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
rf = RandomForestRegressor(random_state=123).fit(X_train, y_train)
print(rf.score(X_test, y_test))
# Output:
# 0.9874788707086176
Cool, so we have built a model that learned the underlying pattern very well. But… what is that pattern?
Model interpretation
It is hard to tell since the actual formula that lives within the model is quite complicated. We can try to see what it has learned using Shapley values. It is not really necessary to know what they are for this article, but if you are still interested, I recommend checking out my other article:
Basically, they let you determine how the different features drive the model output. Using the awesome library shap
, you can do the following:
# !pip install shap
import shap
te = shap.TreeExplainer(rf, feature_names=["h", "g"])
shapley_values = te(X)
shap.plots.scatter(shap_values=shapley_values)

On the left side, you can see that higher values of h result in higher model outputs, i.e., higher values of t. On the right side, you can see that lower values of g produce higher values of t according to the model. Both observations make sense: if we drop an object from a higher place, it will take longer until it reaches the ground. And if the object gets pulled towards the ground with a stronger force, it will reach the ground faster.
Usually, we are happy with a result like this and can deploy the model. Still, we can argue that this approach has a few issues:
- the model is quite complicated, there is zero theoretical insight,
- and because of this, we have to deploy it as a scikit-learn model, i.e., **** we cannot just reimplement it easily when our deployment service does not like scikit-learn models.
Let us see how we can build another model that fixes both issues.
Symbolic Regression
We can utilize a technique called symbolic regression that tries to find simple algebraic expressions that describe the data. For example, the output of a model training on a dataset with the features _x_₁, _x_₂, and _x_₃ with the target y could be y = √_x_₁ + sin(_x_₂/(_x_₃+1)) or y = exp(_x_₁²) – _x_₂. Let us see how it fares on our "free fall" dataset.
There are many packages for symbolic regression, for example, gplearn and PySR. I will use gplearn here since it is slightly easier to use, but it has not been updated for 2 years now, unfortunately. PySR is in active development but uses Julia under the hood, so you have another dependency.
Fitting symbolic regression
First, let us install gplearn via pip install gplearn
. Then, we can define a SymbolicRegressor
and specify a set of operations – such as addition, or taking a logarithm – that it can use within the formula.
from gplearn.genetic import SymbolicRegressor
sr = SymbolicRegressor(
function_set=(
"add",
"sub",
"mul",
"div",
"sqrt",
"log",
"abs",
"neg",
),
random_state=0,
)
sr.fit(X_train, y_train)
The model now tries to find a formula that
- is simple, but
- also describes the data well, i.e., with a small loss.
In my case, this is the result:

which is the ugly version of

since _x_₀ = h and _x_₁ = g. Apart from having a test _r_² of over 0.999, this is actually the correct physics formula that describes falling objects without air resistance! You might also find the equivalent formula h = 0.5 · g · _t_² in literature.
We have rediscovered what smart physicists found out hundreds of years ago, but without any physics knowledge!
And don’t only take it from me. The authors of PySR published a paper in which they show how many more physical laws could be rediscovered. On page 16, you can see this table:

In the column, you can see a bunch of symbolic regression libraries. On the left side, there are physical laws. The fractions in the cells are the number of correct expressions rediscovered, divided by the number of total trials. Here are the formulas:

Planck’s law and the Rydberg formula were too hard to discover for any library. Still, (only) PySR was doing a good job in the other cases. gnlearn was not part of the comparison.
How does this sorcery work?
So, the functional form of our model can be completely crazy. This is fundamentally different from what happens in neural networks, where the formula itself is fixed, and only the parameters are adjusted. Or from tree-based algorithms such as gradient boosting, where the result is always a piecewise constant function.
Long story short, people use Evolutionary Algorithms under the hood that would be too long to cover in this article. These are heavily randomized and heuristic processes that are extremely hard to analyze. Still, they can work quite well in practice, as we have seen above. If you want to have a beginner article about evolutionary algorithms, check out my other one:
If you don’t feel like reading it now, here is the gist of it for our current use case:
- Start with a bunch of simple random formulas, e.g. t = h, t = g², t = g + h, t = √h, …
- Check how good they are using a loss function, e.g. mean squared error.
- Pick some of the best formulas and combine them, e.g. combining t = h and t = g² could result in t = h + g². Or combining t = √h and t = h · g could result in t = √(h · g).
- You can also mutate them, e.g. slightly changing t = √(h·g) to t = √(h + g).
- Now, you have a new set of formulas and can go to step 1 again. Or, if you can spot a very good formula already, stop.
How these steps are implemented depends on the library you use. You have a lot of freedom to design these steps, making writing these algorithms really fun. The PySR authors describe their specific version of evolutionary algorithms in their paper as well.
Conclusion
In this article, we have seen how easy it is to build models with meaningful formulas. In our example, not only the performance in the test set was better.
Since our model is just a simple formula, using this model anywhere is trivial.
For example, you can go to SQL and write this:
SELECT
h,
g,
SQRT(2 * h / g) AS t_pred
FROM
your_table
and here we go! You can also reimplement this formula in any programming language of your choice easily.
Sure, the formula is not always short or describes the data well. In this case, you could always add more operations the algorithm can choose from, like the sine, the exponential function, or you can even create your own small building blocks. This could help create a crisp formula again.
I hope that you learned something new, interesting, and valuable today. Thanks for reading!
If you have any questions, write me on LinkedIn!
And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!