Online Linear Quadratic Tracking with
Regret Guarantees
Abstract
Online learning algorithms for dynamical systems provide finite time guarantees for control in the presence of sequentially revealed cost functions. We pose the classical linear quadratic tracking problem in the framework of online optimization where the time-varying reference state is unknown a priori and is revealed after the applied control input. We show the equivalence of this problem to the control of linear systems subject to adversarial disturbances and propose a novel online gradient descent-based algorithm to achieve efficient tracking in finite time. We provide a dynamic regret upper bound scaling linearly with the path length of the reference trajectory and a numerical example to corroborate the theoretical guarantees.
Optimal Tracking, Online Control.
1 Introduction
Linear quadratic tracking (LQT) is the natural generalization of the optimal linear quadratic regulator (LQR) for the setting where the goal is not to drive the state to the origin but to a certain reference. The reference trajectory need not be necessarily time-invariant and in the classic formulation of the problem is known in advance. This is a reasonable assumption in many practical applications, such as aircraft tracking of a predetermined trajectory or precision control in industrial process engineering. However, in other scenarios, for example, in tracking the output of a secondary agent whose dynamics are unknown and/or the measurements are imperfect, the prediction of the next reference point is non-trivial. In these cases the reference trajectory is only revealed sequentially, after the action has been taken, suggesting the need for an online or adaptive algorithm that will learn or adapt to the dynamics of the reference-generating agent.
In this letter, we study the LQT problem with an unknown reference trajectory. We pose the problem in the framework of online convex optimization (OCO) subject to the dynamics constraint of the system [1, 2]. In particular, the tracking problem is recast into an equivalent regulation problem with a redefined state that evolves with linear dynamics subject to additive adversarial disturbances. In the spirit of online decision-making under computational and memory constraints, our goal is to develop a gradient-based algorithm that is fast and simple to implement and requires no large memory. To this end, we show how classical online gradient descent (OGD) may fail to achieve optimal tracking and propose a modified algorithm, called SS-OGD (steady state OGD) that is guaranteed to achieve the goal under mild conditions. Given the online nature of the algorithm, its performance is quantified through the means of dynamic regret that compares the accumulated finite time cost of a given algorithm to that of an optimal benchmark that solves the LQT problem with an a priori knowledge of the reference trajectory. We provide a dynamic regret bound that scales linearly with the path length of the reference trajectory.
The LQT problem for sequentially revealed adversarial reference states is studied mostly with policy regret guarantees, with one of the first works [3] suggesting a relatively computationally heavy algorithm. In a more recent line of work [4] the authors introduce a memory-based, gradient descent algorithm and in [5] tackle the constrained tracking problem. Several works also provide dynamic regret guarantees for tracking of unknown targets, however, their settings differ from ours. In [6], the authors analyze an output tracking scheme but assume an iterative setting, while in [7] a window of predictions is available. Without predictions, their regret order is determined by that of a fixed oracle controller. In [7], the authors also provide a lower bound for the dynamic regret in terms of the reference path length, matching the same order as our proposed scheme. Gradient-based algorithms, as the ones we study, have also been developed in the context of online feedback optimization [8]. There, in contrast to our setting, the dynamics are generally assumed to be unknown, but, crucially, the cost functions are fixed over the horizon, and the regret is not analyzed. Several recent works, e.g. [9], consider a similar setting with time-varying costs. These, however, are allowed to be estimated offline by training, incompatible with our setting, and without regret guarantees.
Notation: The set of positive real numbers is denoted by and that of non-negative integers by . For a matrix the spectral radius and the spectral norm are denoted by , and , respectively, and denotes its minimum eigenvalue. We define ; one can show that if , there exists a such that for all . For a given vector , its Euclidean norm is denoted by , and the one weighted by some matrix by .
2 Problem Statement
Consider the discrete-time linear time-invariant (LTI) dynamical system, given by
(1) |
where and are the state and input vectors respectively, and are known system matrices. The goal of the optimal LQT problem is the tracking of a time-varying signal , such that the cost
is minimized for some weighting matrices and , and where is the solution of the discrete algebraic Riccati equation (DARE)111The final cost matrix is taken to be for simplicity. For other values of the terminal cost matrix the results still hold up to an additional constant [2].
(2) |
The LQT problem can be recast into an equivalent LQR formulation [10] by considering instead the dynamics
(3) |
with and for all , and the corresponding cost function
(4) |
When the reference trajectory , is known at the initial time a closed form solution for the optimal controller that solves the following optimization problem can be obtained
(5a) | ||||
(5b) |
This controller, often referred to as the optimal offline noncausal controller, can be represented as a linear feedback on the current state and the future reference [11, 12].
Departing from the classical formulation of tracking control, we assume that the reference signal is unknown and is only revealed sequentially after the control input has been applied, similar to the adversarial tracking framework in [3]. In particular, for each time step :
-
1:
The state and the reference state are observed,
-
2:
The agent decides on an input ,
-
3:
The environment decides on the next reference , which, in turn, determines . The error state then evolves according to (3), incurring the following cost for the agent
(6)
Note that the online cost (6), depends on the current input and the unknown disturbance , and is therefore unknown to the decision maker at timestep ; it is revealed only at time , after the input has been applied to the system. Our problem formulation fits the online learning framework, with the extra challenge of inherent dynamics. The goal of the controller is then to minimize the online cumulative cost222For consistency we require . To forego unnecessary cluttering of the notation, the separate treatment of the last timestep is implied implicitly.
This is the same as the LQR cost without the initial state, implying that the minimizers of both problems coincide.
We quantify the finite-time performance of the algorithm through the means of dynamic regret. Consider a policy , mapping from the available information set, , to the control input space. Its dynamic regret, given a disturbance signal , is defined as
(7) |
where is the input generated by and is given by (5).
We allow the trajectory to be arbitrary, as long as it remains bounded.
Assumption 1 (Bounded trajectory)
There exists a , such that for all .
The more abruptly a trajectory changes, the harder it is to achieve good tracking performance, especially if the trajectory is unknown beforehand. To capture this inherent complexity of the problem with dynamic regret, we use the well-established notion of path length [7], [13].
Definition 2.1 (Path Length)
The path length of a reference trajectory is , where .
For more random and abrupt changes in the trajectory, the path length is higher, and one expects the performance of an online algorithm to deteriorate. Likewise, an efficient algorithm should improve as path length decreases. This is captured quantitatively by showing at least a linear dependence of the algorithm’s regret on the path length. One can instead choose the complexity term to be the path length of the artificial disturbances . The resulting path length will then decrease the closer the reference dynamics are to the given system in a certain operator norm, but will scale linearly with time in the case of a constant mismatch between the two. Since we only assume bounded references, we allow for potentially random references without any underlying dynamics. Hence, we choose as our complexity term. Under the following standard assumptions on stabilisability and detectablity[14] the LQR problem is well-posed.
Assumption 2 (LQR is well-posed)
The system is stabilisable, the pair is detectable and .
3 The SS-OGD Algorithm
We consider a control law of the following form
(8) |
where is fixed to the optimal LQR gain, and is a correction term that should account for the unknown disturbances; we will employ online learning techniques to update the latter term.
We investigate the performance of online gradient descent based algorithms. Consider the following “naive” update
(9) |
where is updated in the opposite direction of the gradient of the most recent cost. Here is the step size and the recursion starts from some . As the online objective is quadratic, the gradient is available in a closed form and the update can be represented as . For the case of a constant reference signal and an underactuated system, the algorithm can converge to a point that is not necessarily the optimal one with respect to infinite horizon cost minimization. This is due to the greedy behavior of the update that does not take into account future dynamics. In this section, we propose a simple modification to this myopic OGD update (9), called SS-OGD that accounts for this shortcoming.
To motivate the SS-OGD update, we consider the steady state solution of (3) in closed-loop with the affine control law (8) when we fix and for all subsequent timesteps . Defining , a closed form solution for the steady state and input is given by 333Note that and are both defined for a given and . The dependence is left for simplicity
(10) |
One can then find the which will recover the optimal steady state solution by minimizing the time-averaged infinite horizon steady state cost. For and defined as in (10), this is equivalent to minimizing
(11) |
whose gradient is given by
(12) |
Since is, in general not constant, and the steady state condition is not satisfied , we suggest a new OGD-like update rule on the bias term that is a modified version of the gradient in (12). Specifically, the feedback on the steady state error, , is replaced with the measured error, , and the steady state input, , with the latest applied input, . This results in the following update, named SS-OGD
(13) |
The cost in (12) is defined for the steady state and input and is thus decoupled from the true online cost in (6) that reflects the current and . These are, in general, not at a steady state and is not constant. Thus, is only an auxiliary, hallucinated cost to construct the update (13).
Proof 3.2.
If the matrix is singular, there exists a , such that . Then, for , at steady state . Given the detectability condition of the pair , for any unstable, or marginally stable mode of , the matrix . This ensures that the matrix is positive definite, which is equivalent to the strong convexity of (11) .
The modifications from the standard OGD can be interpreted as incorporating the dynamics information in the update rule. As we show in the following, this ensures that in the limit, if the algorithm is stable and the reference signal is constant, the SS-OGD converges to the same point as the solution of the LQR problem minimizing (4). Moreover, through the feedback on the state and input , the update rule (13) incorporates a proportional integral (PI) control on the measured state. This is demonstrated on a quadrotor control example in Section 5, where, with the inherent integrator dynamics of the quadrotor, the SS-OGD achieves a zero steady state error in tracking a position reference signal with a constant rate of change.
To study the SS-OGD update rule, we introduce the following evolution of the combined system optimizer dynamics
(14) |
where , the matrices and are defined in Appendix 7 and .
Assumption 3
The step size is such that .
Since all the variables in are known a priori, we show that there always exists an satisfying this assumption and provide a sufficient condition in Appendix 7.
The following theorem shows that, for a constant for all , SS-OGD update (13) converges to the solution of
(15) |
with . The solution of (15) can be interpreted as the steady state and steady state input that minimize the infinite horizon time-averaged cost (4).
Theorem 3.3.
4 Regret Analysis
To characterize the effectiveness of the algorithm for time-varying signals and to provide finite time guarantees, we analyze its dynamic regret and show that it scales with the path length. The next theorem summarizes this main result.
Theorem 4.1.
The proof of the theorem is provided in Section 4.1 after some auxiliary results.
Lemma 4.2.
(Cost Difference Lemma [15]) For any two policies
where is the state at time achieved by applying the policy , is the input generated by the policy at time t, is the Q-function for policy and is the cost-to-go at time step , with initial state and control signal .
The proof is omitted, as it is identical to the one for Markov decision processes [15]. The following result for a general policy akin to the result in [16] follows.
Lemma 4.3.
Proof 4.4.
Let be the optimal Q-function, associated with the optimal control law . Then, using Lemma 4.2 the dynamic regret of the policy is given by
(16) |
i.e., a sum of differences of , evaluated at and , its minimizer. For an input, , and some ,
where the last equality follows from the closed form of as an extended quadratic function of [12, 17]. Thus, since minimizes an extended quadratic function, .
For future references, we also recall the Cauchy Product inequality defined for two finite series and :
(17) |
4.1 Proof of Theorem 4.1
As Lemma 4.3 suggests, the dynamic regret depends on the stepwise control input difference,
where and for all ,
We proceed by bounding each of the above terms separately.
Term : This captures the deviation of the artificial disturbance term from the one fixed at timestep . By noting that can be represented as a telescopic sum,
where and . Using (17)
Term : This captures the effect of truncating the infinite horizon problem to a finite one
where is defined in Assumption 1.
Term : This captures the cost of performing a gradient step in the direction of the steady state solution instead of the full solution, for a fixed . Note that is the solution of the following infinite horizon optimization problem and is independent of the initial state [12, 18]
which is equivalent to (15). Hence, by Theorem 3.3
where is the steady state of the SS-OGD dynamics (14) for a given . This term captures the difference between the SS-OGD update term and the steady state value for that timestep. We look at the evolution of the augmented state difference; for all
(18) | ||||
Then for a given time step . Under Assumption 3
Defining , , and
(19) | ||||
There exist , such that and from (19) for all . Using Lemma 4.3 and denoting
Note that, unlike the regret bound of the FOSS algorithm in [7], the constant multiplying above does not depend on , but only on system parameters. This implies that the complexity term captures only the relative distance of the references and is not amplified by their upper bounds.
4.2 Steady State Benchmark
Given Theorem 3.3, one can also compare SS-OGD to the steady state optimal solution for each timestep. Consider
(20) |
for all , where and solve (15). This steady state controller can be interpreted as an optimal benchmark that is decoupled from the system dynamics, has access to the current cost , and hence to the one step ahead reference, , and solves for its optimal, steady state solution. The following Lemma provides a side result on the regret of the SS-OGD algorithm with respect to the steady state controller (20), .
Lemma 4.5.
5 Numerical Example
The SS-OGD algorithm is implemented on a linearized quadrotor model [19] in closed-loop with a PI velocity controller [20], to track a reference trajectory in two dimensions. In particular, we consider the following model
where the state contains the horizontal position, velocity, the roll and pitch angles of the quadrotor, and the input sets the target horizontal velocities. We take and .
In the first experiment, the drone tracks the shape of the letters IFA for an a priori unknown reference with a fixed for all timesteps. SS-OGD’s performance is compared to that of the causal CE controller that solves for the time-averaged infinite horizon steady state cost by fixing all future references to , i.e. . This is equivalent to solving (15) and fixing . The CE controller does not have access to , as opposed to the steady state benchmark in (20). The results in Figure 2 show that even though the CE controller appears to be tracking the reference better in the plot, the time plot reveals that it lags behind the reference trajectory, resulting in around times higher regret, compared to SS-OGD. As the reference signal has a constant rate of change, the double integrator dynamics of the open loop transfer function from the error to the state, allow SS-OGD to achieve perfect position tracking. When this is not the case SS-OGD again outperforms the CE controller.
In a second experiment the empirical worst-case regret as a function is calculated. For each , random reference signals are simulated and the highest value of regret is noted. The references are generated such that decreases with a constant factor of . This ensures a finite path length and therefore a finite regret in the limit, as shown in Figure 2, and in agreement with the upper bound in Theorem 4.1.
6 Conclusion
In this letter, we reformulate the online LQT problem as an online control problem subject to adversarial disturbances. Within this framework, we propose a novel online gradient descent-based algorithm, called SS-OGD, and show that its dynamic regret scales with the path length of the reference signal. We validate the results on numerical examples with a quadrotor model. The improvement of the regret coefficients, as well as the case where the references are generated by some unknown dynamics is left to be studied in future work.
7 System-Optimizer Dynamics
The combined system-optimizer dynamics matrices are
where , and . The objective function in (15) can be equivalently written as for and as for , where
(21) |
Consider the coordinate transformation
with positive definite, as shown in Lemma 3.1. Using the small gain theorem for interconnected systems [14], the following, along with is a sufficient condition for the stability of and therefore of the dynamics (14)
Since is stable, there always exists an arbitrarily small such that the above is fulfilled.
8 Proof of Theorem 3.3
Given a disturbance vector and a bias input , the steady state of the dynamics (3) with the control law (8) is given by , where . Substituting this in the objective function of (15), one can confirm that the that minimizes that cost is the unique (as shown in Lemma 3.1) solution of . Using the definition of , the steady state of the SS-OGD update (13) for a constant solves . Then
Since , and , the two equations coincide, leading to the unique steady state solution .
References
- [1] E. Hazan, S. Kakade, and K. Singh, “The nonstochastic control problem,” in Algorithmic Learning Theory, pp. 408–421, PMLR, 2020.
- [2] D. Foster and M. Simchowitz, “Logarithmic regret for adversarial online control,” in International Conference on Machine Learning, pp. 3211–3221, PMLR, 2020.
- [3] Y. Abbasi-Yadkori, P. Bartlett, and V. Kanade, “Tracking adversarial targets,” in International Conference on Machine Learning, pp. 369–377, PMLR, 2014.
- [4] M. Nonhoff and M. A. Müller, “Online gradient descent for linear dynamical systems,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 945–952, 2020.
- [5] M. Nonhoff, J. Köhler, and M. A. Müller, “Online convex optimization for constrained control of linear systems using a reference governor,” arXiv preprint arXiv:2211.09088, 2022.
- [6] E. C. Balta, A. Iannelli, R. S. Smith, and J. Lygeros, “Regret analysis of online gradient descent-based iterative learning control with model mismatch,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 1479–1484, IEEE, 2022.
- [7] Y. Li, X. Chen, and N. Li, “Online optimal control with linear dynamics and predictions: Algorithms and regret analysis,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [8] A. Hauswirth, S. Bolognani, G. Hug, and F. Dörfler, “Optimization algorithms as robust feedback controllers,” arXiv preprint arXiv:2103.11329, 2021.
- [9] L. Cothren, A. M. Ospina, G. Bianchin, and E. Dall’Anese, “Online optimization of linear-time invariant dynamical systems with cost perception,” in 2022 56th Asilomar Conference on Signals, Systems, and Computers, pp. 1357–1361, IEEE, 2022.
- [10] A. Karapetyan, A. Tsiamis, E. C. Balta, A. Iannelli, and J. Lygeros, “Implications of regret on stability of linear dynamical systems,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 2583–2588, 2023.
- [11] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.
- [12] G. Goel and B. Hassibi, “The power of linear controllers in LQR control,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 6652–6657, IEEE, 2022.
- [13] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in Proceedings of the 20th international conference on machine learning (icml-03), pp. 928–936, 2003.
- [14] M. Green and D. J. Limebeer, Linear robust control. Courier Corporation, 2012.
- [15] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267–274, 2002.
- [16] R. Zhang, Y. Li, and N. Li, “On the regret analysis of online LQR control with predictions,” in 2021 American Control Conference (ACC), pp. 697–703, IEEE, 2021.
- [17] A. Karapetyan, A. Iannelli, and J. Lygeros, “On the regret of control,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 6181–6186, IEEE, 2022.
- [18] C. Yu, G. Shi, S.-J. Chung, Y. Yue, and A. Wierman, “The power of predictions in online control,” Advances in Neural Information Processing Systems, vol. 33, pp. 1994–2004, 2020.
- [19] P. N. Beuchat, “N-rotor vehicles: modelling, control, and estimation,” ETH Zurich Research Collection, 2019.
- [20] A. Karapetyan, “Distributed Control of Flying Quadrotors.” https://github.com/akarapet/admm_collision_avoidance, June 2020.