Robert Etter, Author at Towards Data Science

Reinforcement Learning with PDEs

Robert Etter — Fri, 21 Feb 2025 01:45:39 +0000

Previously we discussed applying reinforcement learning to Ordinary Differential Equations (ODEs) by integrating ODEs within gymnasium. ODEs are a powerful tool that can describe a wide range of systems but are limited to a single variable. Partial Differential Equations (PDEs) are differential equations involving derivatives of multiple variables that can cover a far broader range and more complex systems. Often, ODEs are special cases or special assumptions applied to PDEs.

PDEs include Maxwell’s Equations (governing electricity and magnetism), Navier-Stokes equations (governing fluid flow for aircraft, engines, blood, and other cases), and the Boltzman equation for thermodynamics. PDEs can describe systems such as flexible structures, power grids, manufacturing, or epidemiological models in biology. They can represent highly complex behavior; the Navier Stokes equations describe the eddies of a rushing mountain stream. Their capacity for capturing and revealing more complex behavior of real-world systems makes these equations an important topic for study, both in terms of describing systems and analyzing known equations to make new discoveries about systems. Entire fields (like fluid dynamics, electrodynamics, structural mechanics) can be devoted to study of just a single set of PDEs.

This increased complexity comes with a cost; the systems captured by PDEs are much more difficult to analyze and control. ODEs are also described as lumped-parameter systems, the various parameters and variables that describe them are “lumped” into a discrete point (or small number of points for a coupled system of ODEs). PDEs are distributed parameter systems that track behavior throughout space and time. In other words, the state space for an ODE is a relatively small number of variables, such as time and a few system measurements at a specific point. For PDE/distributed parameter systems, the state space size can approach infinite dimensions, or discretized for computation into millions of points for each time step. A lumped parameter system controls the temperature of an engine based on a small number of sensors. A PDE/distributed parameter system would manage temperature dynamics across the entire engine.

As with ODEs, many PDEs must be analyzed (aside from special cases) through modelling and simulation. However, due to the higher dimensions, this modelling becomes far more complex. Many ODEs can be solved through straightforward applications of algorithms like MATLAB’s ODE45 or SciPy’s solve_ivp. PDEs are modelled across grids or meshes where the PDE is simplified to an algebraic equation (such as through Taylor Series expansion) at each point on the grid. Grid generation is a field, a science and art, on its own and ideal (or usable) grids can vary greatly based on problem geometry and Physics. Grids (and hence problem state spaces) can number in the millions of points with computation time running in days or weeks, and PDE solvers are often commercial software costing tens of thousands of dollars.

Controlling PDEs presents a far greater challenge than ODEs. The Laplace transform that forms the basis of much classical control theory is a one-dimensional transformation. While there has been some progress in PDE control theory, the field is not as comprehensive as for ODE/lumped systems. For PDEs, even basic controllability or observability assessments become difficult as the state space to assess increases by orders of magnitude and fewer PDEs have analytic solutions. By necessity, we run into design questions such as what part of the domain needs to be controlled or observed? Can the rest of the domain be in an arbitrary state? What subset of the domain does the controller need to operate over? With key tools in control theory underdeveloped, and new problems presented, applying machine learning has been a major area of research for understanding and controlling PDE systems.

Given the importance of PDEs, there has been research into developing control strategies for them. For example, Glowinski et. all developed an analytical adjoint based method from advanced functional analysis relying on simulation of the system. Other approaches, such as discussed by Kirsten Morris, apply estimations to reduce the order of the PDE to facilitate more traditional control approaches. Botteghi and Fasel, have begun to apply machine learning to control of these systems (note, this is only a VERY BRIEF glimpse of the research). Here we will apply reinforcement learning on two PDE control problems. The diffusion equation is a simple, linear, second order PDE with known analytic solution. The Kuramoto–Sivashinsky (K-S) equation is a much more complex 4^th order nonlinear equation that models instabilities in a flame front.

For both these equations we use a simple, small square domain of grid points. We target a sinusoidal pattern in a target area of a line down the middle of the domain by controlling input along left and right sides. Input parameters for the controls are the values at the target region and the {x,y} coordinates of the input control points. Training the algorithm required modelling the system development through time with the control inputs. As discussed above, this requires a grid where the equation is solved at each point then iterated through each time step. I used the py-pde package to create a training environment for the reinforcement learner (thanks to the developer of this package for his prompt feedback and help!). With the py-pde environment, approach proceeded as usual with reinforcement learning: the particular algorithm develops a guess at a controller strategy. That controller strategy is applied at small, discrete time steps and provides control inputs based on the current state of the system that lead to some reward (in this case, root mean square difference between target and current distribution).

Unlike previous cases, I only present results from the genetic-programming controller. I developed code to apply a soft actor critic (SAC) algorithm to execute as a container on AWS Sagemaker. However, full execution would take about 50 hours and I didn’t want to spend the money! I looked for ways to reduce the computation time, but eventually gave up due to time constraints; this article was already taking long enough to get out with my job, military reserve duty, family visits over the holidays, civic and church involvement, and not leaving my wife to take care of our baby boy alone!

First we will discuss the diffusion equation:

with x as a two dimensional cartesian vector and ∆ the Laplace operator. As mentioned, this is a simple second order (second derivative) linear partial differential equation in time and two dimensional space. Mu is the diffusion coefficient which determines how fast effects travel through the system. The diffusion equation tends to wash-out (diffuse!) effects on the boundaries throughout the domain and exhibits stable dynamics. The PDE is implemented as shown below with grid, equation, boundary conditions, initial conditions, and target distribution:

from pde import Diffusion, CartesianGrid, ScalarField, DiffusionPDE, pde
grid = pde.CartesianGrid([[0, 1], [0, 1]], [20, 20], periodic=[False, True])
state = ScalarField.random_uniform(grid, 0.0, 0.2)
bc_left={"value": 0}
bc_right={"value": 0}
bc_x=[bc_left, bc_right]
bc_y="periodic"
#bc_x="periodic"
eq = DiffusionPDE(diffusivity=.1, bc=[bc_x, bc_y])
solver=pde.ExplicitSolver(eq, scheme="euler", adaptive = True)
#result = eq.solve(state, t_range=dt, adaptive=True, tracker=None)
stepper=solver.make_stepper(state, dt=1e-3)
target = 1.*np.sin(2*grid.axes_coords[1]*3.14159265)

The problem is sensitive to diffusion coefficient and domain size; mismatch between these two results in washing out control inputs before they can reach the target region unless calculated over a long simulation time. The control input was updated and reward evaluated every 0.1 timestep up to an end time of T=15.

Due to py-pde package architecture, the control is applied to one column inside the boundary. Structuring the py-pde package to execute with the boundary condition updated each time step resulted in a memory leak, and the py-pde developer advised using a stepper function as a work-around that doesn’t allow updating the boundary condition. This means the results aren’t exactly physical, but do display the basic principle of PDE control with reinforcement learning.

The GP algorithm was able to arrive at a final reward (sum mean square error of all 20 points in the central column) of about 2.0 after about 30 iterations with a 500 tree forest. The results are shown below as target and achieved distributed in the target region.

Figure 1: Diffusion equation, green target distribution, red achieved. Provided by author.

Now the more interesting and complex K-S equation:

Unlike the diffusion equation, the K-S equation displays rich dynamics (as befitting an equation describing flame behavior!). Solutions may include stable equilibria or travelling waves, but with increasing domain size all solutions will eventually become chaotic. The PDE implementation is given by below code:

grid = pde.CartesianGrid([[0, 10], [0, 10]], [20, 20], periodic=[True, True])
state = ScalarField.random_uniform(grid, 0.0, 0.5)
bc_y="periodic"
bc_x="periodic"
eq = PDE({"u": "-gradient_squared(u) / 2 - laplace(u + laplace(u))"}, bc=[bc_x, bc_y])
solver=pde.ExplicitSolver(eq, scheme="euler", adaptive = True)
stepper=solver.make_stepper(state, dt=1e-3)
target=1.*np.sin(0.25*grid.axes_coords[1]*3.14159265)

Control inputs are capped at +/-5. The K-S equation is naturally unstable; if any point in the domain exceeds +/- 30 the iteration terminates with a large negative reward for causing the system to diverge. Experiments with the K-S equation in py-pde revealed strong sensitivity to domain size and number of grid points. The equation was run for T=35, both with control and reward update at dt=0.1.

For each, the GP algorithm had more trouble arriving at a solution than in the diffusion equation. I chose to manually stop execution when the solution became visually close; again, we are looking for general principles here. For the more complex system, the controller works better—likely because of how dynamic the K-S equation is the controller is able to have a bigger impact. However, when evaluating the solution for different run times, I found it was not stable; the algorithm learned to arrive at the target distribution at a particular time, not to stabilize at that solution. The algorithm converged to the below solution, but, as the successive time steps show, the solution is unstable and begins to diverge with increasing time steps.

Figure 2: K-S equation Green target; yellow, red, magenta, cyan, blue for T = 10, 20, 30, 40. Provided by author.

Careful tuning on the reward function would help obtain a solution that would hold longer, reinforcing how vital correct reward function is. Also, in all these cases we aren’t coming to perfect solutions; but, especially for the K-S equations we are getting decent solutions with comparatively little effort compared to non-RL approaches for tackling these sorts of problems.

The GP solution is taking longer to solve with more complex problems and has trouble handling large input variable sets. To use larger input sets, the equations it generates become longer which make it less interpretable and slower to compute. Solution equations had scores of terms rather than the dozen or so in ODE systems. Neural network approaches can handle large input variable sets more easily as input variables only directly impact the size of the input layer. Further, I suspect that neural networks will be able to handle more complex and larger problems better for reasons discussed previously in previous posts. Because of that, I did develop gymnasiums for py-pde diffusion, which can easily be adapted to other PDEs per the py-pde documentation. These gymnasiums can be used with different NN-based reinforcement learning such as the SAC algorithm I developed (which, as discussed, runs but takes time).

Adjustments could also be made to the genetic Programming approach. For example, vector representation of inputs could reduce size of solution equations. Duriez et al.¹ all proposes using Laplace transform to introduce derivatives and integrals into the genetic programming equations, broadening the function spaces they can explore.

The ability to tackle more complex problems is important. As discussed above, PDEs can describe a wide range of complex phenomena. Currently, controlling these systems usually means lumping parameters. Doing so leaves out dynamics and so we end up working against such systems rather than with them. Efforts to control or manage these means higher control effort, missed efficiencies, and increased risk of failure (small or catastrophic). Better understanding and control alternatives for PDE systems could unlock major gains in engineering fields where marginal improvements have been the standard such as traffic, supply chains, and nuclear fusion as these systems behave as high dimensional distributed parameter systems. They are highly complex with nonlinear and emergent phenomena but have large available data sets—ideal for machine learning to move past current barriers in understanding and optimization.

For now, I have only taken a very basic look at applying ML to controlling PDEs. Follow ons to the control problem include not just different systems, but optimizing where in the domain the control is applied, experimenting with reduced-order observation space, and optimizing the control for simplicity or control effort. In addition to improved control efficiency, as discussed in Brunton and Kutz², machine learning can also be used to derive data-based models of complex physical systems and to determine reduced order models which reduce state space size and may be more amenable to analysis and control, by traditional or machine learning methods. Machine learning and PDEs is an exciting area of research, and I encourage you to see what the professionals are doing!

Duriez, Thomas., Steven L Brunton, and Bernd R. Noack. Machine Learning Control–Taming Nonlinear Dynamics and Turbulence. ︎
Brunton, Steven., and J. Nathan Kutz. Data Driven Science and Engineering. ︎

The post Reinforcement Learning with PDEs appeared first on Towards Data Science.

Reinforcement Learning for Physics: ODEs and Hyperparameter Tuning

Robert Etter — Thu, 17 Oct 2024 20:20:45 +0000

As discussed previously, Reinforcement Learning (RL) provides a powerful new tool for approaching the challenges of controlling nonlinear physical systems. Nonlinear physical systems are characterized by complex behavior, where small changes in input can lead to dramatic changes in output, or only small output changes may result from large inputs. Solutions can split, where the same conditions can produce different outputs, or even have "memory" in the form of path dependence. We introduced two different approaches to applying RL to a nonlinear physical system: the traditional, neural-network based Soft Actor Critic (SAC) and an uncommon genetic-algorithm based Genetic Programming (GP) approach.

Briefly, SAC uses two neural networks, one to learn how the environment behaves and one to determine an optimal policy. As the model trains, the networks update and the environment learning "critic" network helps evaluate and improve the policy determining "actor" network. GP is based on generating a "forest" of random mathematical equations, evaluation how well they perform in the environment, and then mutating, combining, or making new random equations to improve performance. Applied to gymnasium’s pendulum classic control environment, the GP approach showed faster convergence. Now we expand upon that study by (1) introducing more complex physical systems based on ordinary differential equations and (2) exploring impact of hyperparameter tuning on algorithm performance for both SAC and GP.

Working with ODEs

Physical systems can typically be modeled through differential equations, or equations including derivatives. Forces, hence Newton’s Laws, can be expressed as derivatives, as can Maxwell’s Equations, so differential equations can describe most physics problems. A differential equation describes how a system changes based on the system’s current state, in effect defining state transition. Systems of differential equations can be written in matrix/vector form:

where x is the state vector, A is the state transition matrix determined from the physical dynamics, and x dot (or dx/dt) is the change in the state with a change in time. Essentially, matrix A acts on state x to advance it a small step in time. This formulation is typically used for linear equations (where elements of A do not contain any state vector) but can be used for nonlinear equations where the elements of A may have state vectors which can lead to the complex behavior described above. This equation describes how an environment or system develops in time, starting from a particular initial condition. In mathematics, these are referred to as initial value problems since evaluating how the system will develop requires specification of a starting state.

The expression above describes a particular class of differential equations, ordinary differential equations (ODE) where the derivatives are all of one variable, usually time but occasionally space. The dot denotes dx/dt, or change in state with incremental change in time. ODEs are well studied and linear systems of ODEs have a wide range of analytic solution approaches available. Analytic solutions allow solutions to be express in terms of variables, making them more flexible for exploring the whole system behavior. Nonlinear have fewer approaches, but certain classes of systems do have analytic solutions available. For the most part though, nonlinear (and some linear) ODEs are best solved through simulation, where the solution is determined as numeric values at each time-step.

Simulation is based around finding an approximation to the differential equation, often through transformation to an algebraic equation, that is accurate to a known degree over a small change in time. Computers can then step through many small changes in time to show how the system develops. There are many algorithms available to calculate this will such as Matlab’s ODE45 or Python SciPy’s solve_ivp functions. These algorithms take an ODE and a starting point/initial condition, automatically determine optimal step size, and advance through the system to the specified ending time.

If we can apply the correct control inputs to an ODE system, we can often drive it to a desired state. As discussed last time, RL provides an approach to determine the correct inputs for nonlinear systems. To develop RLs, we will again use the gymnasium environment, but this time we will create a custom gymnasium environment based on our own ODE. Following Gymnasium documentation, we create an observation space that will cover our state space, and an action space for the control space. We initialize/reset the gymnasium to an arbitrary point within the state space (though here we must be cautious, not all desired end states are always reachable from any initial state for some systems). In the gymnasium’s step function, we take a step over a short time horizon in our ODE applying the algorithm estimated input using Python SciPy solve_ivp function. Solve_ivp calls a function which holds the particular ODE we are working with. Code is available on git. The init and reset functions are straightforward; init creates and observation space for every state in the system and reset sets a random starting point for each of those variables within the domain at a minimum distance from the origin. In the step function, note the solve_ivp line that calls the actual dynamics, solves the dynamics ODE over a short time step, passing the applied control K.

#taken from https://www.gymlibrary.dev/content/environment_creation/
#create gym for Moore-Greitzer Model
#action space: continuous  +/- 10.0 float , maybe make scale to mu 
#observation space:  -30,30 x2 float for x,y,zand
#reward:  -1*(x^2+y^2+z^2)^1/2 (try to drive to 0)

#Moore-Grietzer model:

from os import path
from typing import Optional

import numpy as np
import math

import scipy
from scipy.integrate import solve_ivp

import gymnasium as gym
from gymnasium import spaces
from gymnasium.envs.classic_control import utils
from gymnasium.error import DependencyNotInstalled
import dynamics  #local library containing formulas for solve_ivp
from dynamics import MGM

class MGMEnv(gym.Env):
    #no render modes
    def __init__(self, render_mode=None, size=30):

        self.observation_space =spaces.Box(low=-size+1, high=size-1, shape=(2,), dtype=float)

        self.action_space = spaces.Box(-10, 10, shape=(1,), dtype=float) 
        #need to update action to normal distribution

    def _get_obs(self):
        return self.state

    def reset(self, seed: Optional[int] = None, options=None):
        #need below to seed self.np_random
        super().reset(seed=seed)

        #start random x1, x2 origin
        np.random.seed(seed)
        x=np.random.uniform(-8.,8.)
        while (x>-2.5 and x<2.5):
            np.random.seed()
            x=np.random.uniform(-8.,8.)
        np.random.seed(seed)
        y=np.random.uniform(-8.,8.)
        while (y>-2.5 and y<2.5):
            np.random.seed()
            y=np.random.uniform(-8.,8.)
        self.state = np.array([x,y])
        observation = self._get_obs()

        return observation, {}

    def step(self,action):

        u=action.item()

        result=solve_ivp(MGM, (0, 0.05), self.state, args=[u])

        x1=result.y[0,-1]
        x2=result.y[1,-1]
        self.state=np.array([x1.item(),x2.item()])
        done=False
        observation=self._get_obs()
        info=x1

        reward = -math.sqrt(x1.item()**2)#+x2.item()**2)

        truncated = False #placeholder for future expnasion/limits if solution diverges
        info = x1

        return observation, reward, done, truncated, {}

Below are the dynamics of the Moore-Greitzer Mode (MGM) function. This implementation is based on solve_ivp documentation . Limits are placed to avoid solution divergence; if system hits limits reward will be low to cause algorithm to revise control approach. Creating ODE gymnasiums based on the template discussed here should be straightforward: change the observation space size to match the dimensions of the ODE system and update the dynamics equation as needed.

def MGM(t, A, K):
    #non-linear approximation of surge/stall dynamics of a gas turbine engine per Moore-Greitzer model from
    #"Output-Feedbak Cotnrol on Nonlinear systems using Control Contraction Metrics and Convex Optimization"
    #by Machester and Slotine
    #2D system, x1 is mass flow, x2 is pressure increase
    x1, x2 = A
    if x1>20:  x1=20.
    elif x1<-20:  x1=-20.
    if x2>20:  x2=20.
    elif x2<-20:  x2=-20.
    dx1= -x2-1.5*x1**2-0.5*x1**3
    dx2=x1+K
    return np.array([dx1, dx2])

For this example, we are using an ODE based on the Moore-Greitzer Model (MGM) describe gas turbine engine surge-stall dynamics¹. This equation describes coupled damped oscillations between engine mass flow and pressure. The goal of the controller is to quickly dampen oscillations to 0 by controlling pressure on the engine. MGM has "motivated substantial development of nonlinear control design" making it an interesting test case for the SAC and GP approaches. Code describing the equation can be found on Github. Also listed are three other nonlinear ODEs. The Van Der Pol oscillator is a classic nonlinear oscillating system based on dynamics of electronic systems. The Lorenz Attractor is a seemingly simple system of ODEs that can product chaotic behavior, or results highly sensitive to initial conditions such that any infinitely small different in starting point will, in an uncontrolled system, soon lead to widely divergent state. The third is a mean-field ODE system provided by Duriez/Brunton/Noack that describes development of complex interactions of stable and unstable waves as an approximation to turbulent fluid flow.

To avoid repeating analysis of the last article, we simply present results here, noting that again the GP approach produced a better controller in lower computational time than the SAC/neural network approach. The figures below show the oscillations of an uncontrolled system, under the GP controller, and under the SAC controller.

Uncontrolled dynamics, provided by author

GP controller results, provided by author

SAC controlled dynamics, provided by author

Both algorithms improve on uncontrolled dynamics. We see that while the SAC controller acts more quickly (at about 20 time steps), it is low accuracy. The GP controller takes a bit longer to act, but provides smooth behavior for both states. Also, as before, GP converged in fewer iterations than SAC.

We have seen that gymnasiums can be easily adopted to allow training RL algorithms on ODE systems, briefly discussed how powerful ODEs can be for describing and so exploring RL control of physical dynamics, and seen again the GP producing better outcome. However, we have not yet tried to optimize either algorithm, instead just setting up with, essentially, a guess at basic algorithm parameters. We will address that shortcoming now by expanding the MGM study.

Sagemaker Hyperparmeter Tuning with Custom Models

As discussed previously, both GP and SAC have a set of hyperparameters that define the model. These parameters are constant during model training, but can be changed to try to improve model performance (such as accuracy or convergence speed). As a quick review, the following table describes the hyperparameters used in the GP algorithm:

Ni, Ne, Nn, Pr, Pm, Pc all affect exploration vs exploitation, or how much the algorithm spends time trying to find new possible solutions against refining the best solutions it already has. N batches trades increased computation time for increased accuracy and generalizability.

SAC as implemented here has the following hyperparameters:

To simplify coding and tuning hyperparameters, several ground rules have been imposed. Each hidden layer will have the same number of neurons, and each neural network (actor and critic) will have the same dimensions (other than input and output layer) and batch/buffer for update. Also, each neural network will use the same activation functions and optimizer. These parameters, especially neural network shape/dimensions, are valid hyperparameters but omitted from tuning here to reduce code complexity and computation time.

The goal with tuning hyperparameters is to determine which ones will product the most accurate model with the least computational cost. However, tuning hyperparameters requires training the model for each set of hyperparameters. Exploring the entire hyperparameter space, even for a modest number of hyperparameters, can lead to geometrically large test matrices if we wish to test a wide range of values for those parameters. This problem is complicated as parameters parameters can be coupled (i.e. the optimal value of one parameter may change depending on the setting of another). There are several ways to tune hyperparameters. A grid search will test every combination of an entire grid, requiring careful selection of which parameters and their values to test. A random search tries random parameters from a grid. Finally, some mathematical optimization approach could be used, such as Bayesian optimization or another ML algorithm. In any case, the best approach requires careful consideration (and maybe hyper-hyper-parameter optimization…)

AWS Sagemaker offers built in hyperparameter optimization for Sagemaker’s included or custom algorithms. Sagemaker’s tuning options are random, grid, Bayesian, or hyperband (which favors well performing sets of hyperparameters and can prematurely stop underperforming sets). To use Sagemaker’s Hyperparameter Tuning, we must provide the algorithms as Docker containers in Sagemaker, and pass the container image and training script into a hyperparameter tuning object.

As neither GP nor the specific SAC implementation use an existing Sagemaker algorithm or framework (the SAC used here is based on Jax and Haiku, rather than tensorflow, pytorch, or mxnet), we will need to create custom RL frameworks. After exploring several tutorials and much trial and error, I was able to build properly working containers and training scripts for hyperparamter tuning. There were several tricky parts; for example, I found I had to zip my traing file, upload it to S3, and then pass the path of the zip file in S3 in order to successfully use the hyperparameter argument of Sagemaker’s "estimator" ML object. Dockerfile, container files, training scripts, and Jupyter notebooks used in Sagemaker are available on git for SAC and GP. Links to some of the sources used are available in the notbeooks on Git.

This approach could be refined; for example the app.py file probably doesn’t need to be in the container. Also, I put my custom ODE gymnasiums inside of the "Classical Control" gymnasium and loaded it locally to reduce the time spent building my own gymnasium from scratch.

Once the containers were working, I roughly followed an AWS blog to set up the hyperparameter tuning job. To make the hyperparameters work in the training scripts (app.py for GP, sacapp.py for SAC) I set up an argparse for the parameters as guided by Sagemaker github examples. To limit the number of runs (and personal cost) of the tuning jobs, I selected a limited set of hyperparameters to focus on exploring the concept and evaluating how much effect tuning would have.

Running the hyperparameter tuning job was quick; results are given below:

Only Probability of Mutation (Pm) has an optimal value near the boundary of the range.

Sagemaker’s examples provide hyperparmeter visualization scripts that allow us to review how the tuning jobs went. We review them for SAC below (results for GP hyperparameter tuning are omitted for brevity). First, we see an overview of the different tuning jobs (squares were stopped prematurely, circles completed) over time against the reward.

The visualizations also provide a breakdown by parameter of performance, providing insight into impact of different parameters on algorithm performance. Below we look at number of neurons per hidden layer and see a trend optimizing around 8.

We’ve only scratched the surface of ODEs and hyperparmeters. Specifically, the exploration of SAC tuning has been rudimentary; neural network design is a science (or perhaps art) unto itself. However, hopefully this article has provided an insight into and starting point for applying and optimizing RL for physical dynamics!

[1] Manchester, Ian R., and Jean-Jacques E. Slotine. "Output-Feedback Control of Nonlinear Systems Using Control Contraction Metrics and Convex Optimization." 2014 4th Australian Control Conference (AUCC) (November 2014).

The post Reinforcement Learning for Physics: ODEs and Hyperparameter Tuning appeared first on Towards Data Science.

Reinforcement Learning for Physical Dynamical Systems: An Alternative Approach

Robert Etter — Sun, 28 Jul 2024 19:02:50 +0000

Physical and Nonlinear Dynamics

Control theory, through classical, robust, and optimal approaches, enable modern civilization. Refining, telecommunications, modern manufacturing and more depend on them. Control theory has been built on the insight provided by physics equations, such as derived from Newton’s Laws and Maxwell’s equations. These equations describe the dynamics, the interplay of different forces, on physical systems. Through them we understand how the equation moves between states, where a state is "the set of all information that sufficiently describes the system" [1], often in terms of variables such as pressure or velocity of fluid particles in fluid dynamics, or charge and current states in electrodynamics. By deriving equations for the systems, we can predict how the states change through time and space and express this evolution in terms of a differential equation. With this understanding, we can apply controls in the form of specially applied forces to maintain these systems at a desired state or output. Typically, this force is calculated based on the output of the system. Consider a vehicle cruise control. The input is the desired speed, the output the actual speed. The system is the engine. The state estimator observes the speed and determines what the difference between output and input speed is and how apply a control, such as adjusting fuel flow, to reduce the error.

However, for all its accomplishments, control theory encounters substantial limitations. Most control theory is built around linear systems, or systems where a proportional change in input leads to a proportional change in output. While these systems can be quite complex, we have extensive understanding of these systems, affording us practical control of everything from deep ocean submersibles and mining equipment to spacecraft.

However, as Stanislaw Ulam remarked, "using a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant animals." Our progress so far in controlling complex physical systems has mostly come through finding ways to limit them to linear behavior. This can cost us efficiency in several ways:

· Break down complex system into component parts that are individually controlled, optimizing for subsystems rather than the system as a whole

· Operate systems at simpler, but less efficient operating modes or not take advantage of complex physics, such as active flow control to reduce aircraft drag

· Tight operating condition limits that can result in unpredictable or catastrophic failure if exceeded

Advanced manufacturing, improved aerodynamics, and complex telecommunications would all benefit from a better approach to control of nonlinear systems.

The fundamental characteristic of nonlinear dynamical systems is their complex response to inputs. Nonlinear systems vary dramatically even with small changes in environment or state. Consider the Navier-Stokes equations that govern fluid flow: the same set of equations describes a placid, slow flowing stream as a raging torrent, and all the eddies and features of the raging torrent are contained within the equation dynamics.

Nonlinear systems present difficulties: unlike linear systems we often don’t have an easily predictable idea of how the system will behave as it transitions from one state to the next. The best we can approach is through general analysis or extensive simulation. Hence, with nonlinear systems we are faced with two problems: system identification – that is, understanding how it will behave at a given state, and system control – how it will change in the short and long term in response to a given input and so what input to make to get the desired outcome.

Reinforcement Learning for Physics

While nonlinear analysis and control continues to make progress, we remain limited in our ability to exploit these systems using traditional, equation-based methods. However, as computing power and sensor technology become more accessible, data-based approaches offer a different approach.

The mass increase in data availability has given rise to Machine Learning (ML) approaches, and reinforcement learning (RL) provides a new approach to tackling the challenge of controlling nonlinear dynamical systems more effectively. RL, already finding success in environments from self-driving cars to strategy and computer games, is an ML framework which trains algorithms, or agents, "to learn how to make decisions under uncertainty to maximize a long-term benefit through trial and error" [1]. In other words, RL algorithms address the problems of system identification and control optimization and do this not by manipulation and analysis of governing equations, but by sampling the environment to develop a prediction of what input actions lead to desired outcomes. RL algorithms, or agents, apply a policy of actions based on the system state, and refine this policy as they analyze more information on the system.

Many RL algorithms are based on using neural networks to develop functions that map state to optimal behavior. RL problems can be framed as state-action-reward tuples. For a given state, a certain action leads to a given reward. Neural networks act as universal function approximators that can be tuned to accurately approximate the state-action-reward tuple function across an entire system. To do so it must acquire new knowledge by exploring the system or environment, and then refine its policy by exploiting the additional data gained. RL algorithms are differentiated by how they apply mathematics to explore, exploit, band balance between the two.

However, neural networks pose several challenges:

· Resource requirements. Using a neural network to estimate a function that can determine the reward and best action to take for every state can take considerable time and data.

· Explainability. It is often difficult to understand how neural networks are arriving at their solutions, which limits their utility for providing real insight and can make it hard to predict or bound the action of a neural network. Explainability is especially important for physical systems as it would allow the powerful analytical tools developed over several centuries of mathematics to be used to gain additional insight into a system.

While there are approaches, such as transfer learning and topological analysis, to address these challenges, they remain barriers to fuller application of RL. However, an alternate approach may be useful in our case where we are looking specifically at physical systems. Recall that the physical systems we are discussing are defined by, or can be very well described by, mathematical equations. Instead of having to develop a completely arbitrary function, we can focus on trying to find an expression comprised of common mathematical operators: arithmetic, algebraic, and transcendental functions (sine, e^x, etc.). Or means to this end will be using genetic algorithms. As described in [2], genetic algorithms can be adapted to explore function spaces through random generation of functions and exploiting and refining solutions through mutations and cross-breeding of promising candidates.

So, while neural networks are champions of most RL problems, for physical dynamics a new challenger appears. Next we will take a closer look the generic algorithm approach and see how it fairs against a leading RL algorithm, soft actor critic. To do this we will evaluate both in Physics-based gymnasiums using AWW Sagemaker Experiments. We will conclude by evaluating the results, discussing conclusions, and suggesting next steps.

Recall that RL faces two challenges, exploring the environment and exploiting the information discovered. Exploration is necessary to find the best policy considering the likelihood of being in any state. Failure to explore means both a global optimum may be missed for a local, and the algorithm may not generalize sufficiently to succeed in all states. Exploitation is needed to refine the current solution to an optimum. However, as an algorithm refines a particular solution, it trades away the ability to explore the system further.

Soft Actor Critic (SAC) is a refinement of the powerful Actor-Critic RL approach. The Actor-Critic family of algorithms approaches the explore/exploit trade off by separating estimation of the state values and associated reward from optimizing a particular policy of inputs. As the algorithm collects new information, it updates each estimator. Actor-Critic has many nuances to its implementation; interested readers should consult books or online tutorials. SAC optimizes the critic by favoring exploration of states which have rewards dramatically different then the critic estimated. OpenAI provides a detailed description of SAC.

For this experiment, we use the Coax implementation of SAC. I looked at several RL libraries, including Coach and Spinning Up, but Coax was one of the few I found to work mostly "out of the box" with current Python builds. The Coax library includes a wide range of RL algorithms, including PPO, TD3, and DDPG and works well with gymnasium.

Actor-critic methods such as SAC are typically implemented through neural networks as the function approximator. As we discussed last time, there is another potential approach to exploring the system and exploiting potential control policies. Genetic algorithms explore through random generation of possible solutions and exploit promising policies by mutating or combining elements (breeding) of different solutions. In this case, we will evaluate a genetic programming variant of genetic algorithms as an alternative means of function approximation; specifically, we will use a genetic approach to randomly generate and then evaluate trees of functions containing constants, state variables, and mathematical functions as potential controllers.

The Genetic Programming (GP) algorithm implemented is adapted from [2] except in place of tournament used by that text, this implementation selects the top 64% (Nn below of 33%) of each generation as eligible for mutation and reseeds the remainder for better exploration of the solution space. To create each individual tree in a generation, a growth function randomly calls from arithmetic functions (+,-,*, /) and transcendental functions (such as e^x, cos (x)) to build branches with constants or state variables as leaves to end branches. Recursive calls are used to build expressions based on Polish notation ([2] implemented via LISP, I have adapted to Python), with rules in place to avoid i.e. divide by 0 and ensure mathematical consistency so that every branch ends correctly in a constant or sensor value leaf. Conceptually, an equation tree appears as:

Fig 1. Example function tree, provided by author based on [2]

This results in a controller b= sin (s1)+ e^(s1s2/3.23)-0.12, written by the script as: – + sin s1 e^ / s1 s2 3.23 0.12, where s denote state variables. It may seem confusing at first but writing out a few examples will clarify the approach.

With a full generation of trees built, each one is then run through the environment to evaluate performance. The trees are then ranked for control performance based on achieved reward. If the desired performance is not met, the best performing tree is preserved, the top 66% are mutated by crossover (swapping elements of two trees), cut and grow (replace an element of a tree), shrink (replace a tree element with a constant) or re-parameterize (replace all constants in a tree) following [2]. This allows exploitation of the most promising solutions. To continue to explore the solution space, the low performing solutions are replaced with random new trees. Each successive generation is then a mix of random new individuals and replications or mutations of the top performing solutions.

Trees are tested against random start locations within the environment. To prevent a "lucky" starting state from skewing results (analogous to overfitting the model), trees are tested against a batch of different random starting states.

Hyperparameters for genetic programming include:

Table 1. Hyperparamters for Genetic Progamming Algorthim

Commented code can be found on github. Note that I am a hobby coder, and my code is kludgy. Hopefully it is at least readable enough to understand my approach, despite any un-pythonic or generally bad coding practice.

Evaluating the Approaches

Both algorithms were evaluated in two different gymnasium environments. The first is the simple pendulum environment provided by gymnasium foundation. The inverted pendulum is a simple nonlinear dynamics problem. The action space is a continuous torque that can be applied to the pendulum. The observation space is the same as the state and is the x,y coordinates and angular velocity. The goal is to hold the pendulum upright. The second is the same gymnasium, but with random noise added to the observation. The noise is normal with mean 0 and variance 0.1 to simulate realistic sensor measurements.

One of the most important parts of RL development is designing a proper reward function. While there are many algorithms that can solve a given RL problem, defining an appropriate reward for those algorithms to optimize to is a key step in making a given algorithm successful for a specific problem. Our reward needs to allow us to compare the results of two different RL approaches while ensuring each proceeds to its goal. Here, for each trajectory we track cumulative reward and average reward. To make this easier, we have each environment run for a fixed number of time steps with a negative reward based on how far from the target state an agent is at each time step. The Pendulum gym operates this way out of the box – truncation at 200 timesteps and a negative reward depending on how far from upright the pendulum is, with a max reward at 0, enforced at every time step. We will use average reward to compare the two approaches.

Our goal is to evaluate convergence speed of each RL framework. We will accomplish this using AWS Sagemaker Experiments, which can automatically track metrics (such as current reward) and parameters (such as active hyperparmeters) across runs by iteration or CPU time. While this monitoring could be accomplished through python tools, Experiments offers streamlined tracking and indexing of run parameters and performance and replication of compute resources. To set up the experiment, I adapted the examples provided by AWS. The SAC and GP algorithms were first assed in local Jupyter notebooks and then uploaded to a git repository. Each algorithm has its own repository and Sagemaker notebook. The run parameters are stored to help classify the run and track performance of different experiment setups. Run metrics, for our cases reward and state vector, are the dependent variables we want to measure to compare the two algorithms. Experiments automatically record CPU time and iteration as independent variables.

Through these experiments we can compare the performance of the champion, a well-developed, mature RL algorithm like SAC, against the contender, a little-known approach coded by a hobby coder without formal RL or python training. This experiment will provide insight into different approaches to developing controllers for complex, non-linear systems. In the next part we will review and discuss results and potential follow-ons.

The first experiment was the default pendulum gymnasium, where the algorithm tries to determine the correct torque to apply to keep pendulum inverted. It ends after a fixed time and gives negative reward based on how far from vertical the pendulum is. Prior to running in Sagemaker experiment, both SAC and GP algorithms were run on my local machine to verify convergence. Running in experiments allowed better tracking of comparable compute time. Results of compute time against average reward per iteration follow:

Provided by author

We see that GP, despite being a less mature algorithm, arrived at a solution with far less computational requirement than SAC. On the local run to completion, SAC seemed to take about 400,000 iterations to converge, requiring several hours. The local instantiation was programmed to store recordings of SAC progress throughout training; interestingly SAC seemed to move from learning how to move the pendulum towards the top to learning how to hold the pendulum still, and then combined these, which would explain the dip in the reward as the time when SAC was learning to hold the pendulum steady. With GP we see monotonic increase in reward in steps. This is because the best performing function tree is always retained, so the best reward stays steady until a better controller is calculated.

The second experiment was adding Gaussian noise (0, 0.1) to the state measurement. We see similar results as with the no-noise situation, except with longer convergence times. Results are shown below; again, GP outperforms SAC.

Provided by author

In both cases we see GP perform faster than SAC (as with the previous example, SAC did converge locally, I just didn’t want to pay AWS for the compute time!). However, as many of you have no doubt noticed, this has been a very basic comparison, both in terms of machine learning and physical systems. For example, hyperparameter tunning could result in different results. Still, this is a promising start for the contender algorithm and show it to be worth further investigation.

In the long run, I think GP may offer several benefits over neural network-based approaches like SAC:

· Explainability. While the equation GP finds can be convoluted, it is transparent. Skilled may simplify the equation, helping provide insight into the physics of the determined solution, helpful for determining regions of applicability and increasing trust in the control. Explainability, while an active area of research, remains a challenge for neural networks.

· Informed ML. GP allows easier application of insight into the system under analysis. For example, if the system is known to have sinusoidal behavior, the GP algorithm can be adapted to try more sinusoidal solutions. Alternatively, if a solution is known for a similar or simplified system to the one under study, then that solution can be pre-seeded into the algorithm.

· Stability. With addition of simple safeguards mathematical validity and limit absolute value, GP approaches will remain stable. As long as the top performer is retained each generation then solution will converge, though time bounds on convergence are not guaranteed. Neural network approaches of more common RL do not have such guarantees.

· Developmental Opportunity. GP is relatively immature. The SAC implementation here was one of several available for application, and neural networks have been the benefit of extensive effort to improve performance. GP hasn’t benefit from such optimization; my implementation was built around function rather than efficiency. Despite this, it performed well against SAC, and further improvements from more professional developers could provide high gains in efficiency.

· Parallelizability and modularity. Individual GP equations are simple compared to NNs, the computational cost comes from repeated runs through the environment rathe than environment runs and backpropagation of NNs. It would be easy to split a "forest" of different GP equation trees across different processors to greatly improve computing speed.

However, neural network approaches are used more extensively for good reason:

· Scope. Neural networks are universal function approximators. GP is limited to the terms defined in the function tree. Hence, neural network based approaches can cover a far greater range and complexity of situations. I would not want to try GP to play Starcraft or drive a car.

· Tracking. GP is a refined version of random search, which results, as seen in the experiment, halting improvement.

· Maturity. Because of the extensive work across many different neural based algorithms, it is easier to find an existing one optimized for computational efficiency to more quickly apply to a problem.

From a machine learning perspective, we have only scratched the surface of what we can do with these algorithms. Some follow-ons to be considered include:

· Hyperparameter tuning.

· Controller simplicity, such as penalizing reward for number of terms in control input for GP.

· Controller efficiency, such as detracting size of control input from reward.

· GP monitoring and algorithm improvement as described above.

From a physics perspective, this experiment serves as a launching point into more realistic scenarios. More complex scenarios will likely show NN approaches catch up to or surpass GP. Possible follow-ons include:

· More complex dynamics such as Van Der Pol equations or higher dimensionality.

· Limited observability instead of full state observability.

· Partial Differential Equation systems and optimizing controller location as well as input.

[1] E. Bilgin, Mastering Reinforcement Learning with Python: Build next-generation, self-learning models using reinforcement learning techniques and best practices (2020), Packit Publishing

[2] T Duriez, S. Brunton, B. Noack, Machine Learning Control- Taming Nonlinear Dynamics and Turbulence (2017), Spring International Publishing

The post Reinforcement Learning for Physical Dynamical Systems: An Alternative Approach appeared first on Towards Data Science.