Online Linear Quadratic Tracking with
Regret Guarantees

Aren Karapetyan Diego Bolliger Anastasios Tsiamis Efe C. Balta and John Lygeros This work has been supported by the Swiss National Science Foundation under NCCR Automation (grant agreement

51\text{NF}40\_180545

), and by the European Research Council under the ERC Advanced grant agreement

787845

(OCAL). Aren Karapetyan and Diego Bolliger contributed equally to this work.Aren Karapetyan, Anastasios Tsiamis, and John Lygeros are with the Automatic Control Laboratory, Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zürich, Switzerland (e-mail: {akarapetyan, atsiamis, jlygeros}@control.ee.ethz.ch).Diego Bolliger is with the School of Engineering, ZHAW Zurich University of Applied Sciences, 8400 Winterthur, Switzerland (e-mail: [email protected]).Efe C. Balta is with the Control and Automation Group, Inspire AG, 8005 Zürich, Switzerland (e-mail: [email protected]).

Abstract

Online learning algorithms for dynamical systems provide finite time guarantees for control in the presence of sequentially revealed cost functions. We pose the classical linear quadratic tracking problem in the framework of online optimization where the time-varying reference state is unknown a priori and is revealed after the applied control input. We show the equivalence of this problem to the control of linear systems subject to adversarial disturbances and propose a novel online gradient descent-based algorithm to achieve efficient tracking in finite time. We provide a dynamic regret upper bound scaling linearly with the path length of the reference trajectory and a numerical example to corroborate the theoretical guarantees.

{IEEEkeywords}

Optimal Tracking, Online Control.

1 Introduction

\IEEEPARstart

Linear quadratic tracking (LQT) is the natural generalization of the optimal linear quadratic regulator (LQR) for the setting where the goal is not to drive the state to the origin but to a certain reference. The reference trajectory need not be necessarily time-invariant and in the classic formulation of the problem is known in advance. This is a reasonable assumption in many practical applications, such as aircraft tracking of a predetermined trajectory or precision control in industrial process engineering. However, in other scenarios, for example, in tracking the output of a secondary agent whose dynamics are unknown and/or the measurements are imperfect, the prediction of the next reference point is non-trivial. In these cases the reference trajectory is only revealed sequentially, after the action has been taken, suggesting the need for an online or adaptive algorithm that will learn or adapt to the dynamics of the reference-generating agent.

In this letter, we study the LQT problem with an unknown reference trajectory. We pose the problem in the framework of online convex optimization (OCO) subject to the dynamics constraint of the system [1, 2]. In particular, the tracking problem is recast into an equivalent regulation problem with a redefined state that evolves with linear dynamics subject to additive adversarial disturbances. In the spirit of online decision-making under computational and memory constraints, our goal is to develop a gradient-based algorithm that is fast and simple to implement and requires no large memory. To this end, we show how classical online gradient descent (OGD) may fail to achieve optimal tracking and propose a modified algorithm, called SS-OGD (steady state OGD) that is guaranteed to achieve the goal under mild conditions. Given the online nature of the algorithm, its performance is quantified through the means of dynamic regret that compares the accumulated finite time cost of a given algorithm to that of an optimal benchmark that solves the LQT problem with an a priori knowledge of the reference trajectory. We provide a dynamic regret bound that scales linearly with the path length of the reference trajectory.

The LQT problem for sequentially revealed adversarial reference states is studied mostly with policy regret guarantees, with one of the first works [3] suggesting a relatively computationally heavy algorithm. In a more recent line of work [4] the authors introduce a memory-based, gradient descent algorithm and in [5] tackle the constrained tracking problem. Several works also provide dynamic regret guarantees for tracking of unknown targets, however, their settings differ from ours. In [6], the authors analyze an output tracking scheme but assume an iterative setting, while in [7] a window of predictions is available. Without predictions, their regret order is determined by that of a fixed oracle controller. In [7], the authors also provide a lower bound for the dynamic regret in terms of the reference path length, matching the same order as our proposed scheme. Gradient-based algorithms, as the ones we study, have also been developed in the context of online feedback optimization [8]. There, in contrast to our setting, the dynamics are generally assumed to be unknown, but, crucially, the cost functions are fixed over the horizon, and the regret is not analyzed. Several recent works, e.g. [9], consider a similar setting with time-varying costs. These, however, are allowed to be estimated offline by training, incompatible with our setting, and without regret guarantees.

Notation: The set of positive real numbers is denoted by $\mathbb{R}_{+}$ and that of non-negative integers by $\mathbb{N}$ . For a matrix $W$ the spectral radius and the spectral norm are denoted by $\rho(W)$ , and $\|W\|$ , respectively, and $\lambda_{min}(W)$ denotes its minimum eigenvalue. We define $\lambda_{W}:=\frac{1+\rho(W)}{2}$ ; one can show that if $\rho(W)<1$ , there exists a $c_{W}\in\mathbb{R}_{+}$ such that for all $k\geq 1$ $\|W^{k}\|\leq c_{W}{\lambda_{W}}^{k}$ . For a given vector $x$ , its Euclidean norm is denoted by $\|x\|$ , and the one weighted by some matrix $Q$ by $\|x\|_{Q}=\sqrt{x^{\top}Qx}$ .

2 Problem Statement

Consider the discrete-time linear time-invariant (LTI) dynamical system, given by

x_{t+1}=Ax_{t}+Bu_{t},\quad\forall t\in\mathbb{N},

(1)

where $x_{t}\in\mathbb{R}^{n}$ and $u_{t}\in\mathbb{R}^{m}$ are the state and input vectors respectively, and $A\in\mathbb{R}^{n\times n},B\in\mathbb{R}^{n\times m}$ are known system matrices. The goal of the optimal LQT problem is the tracking of a time-varying signal $r_{t}\in\mathbb{R}^{n}$ , such that the cost

\|x_{T}-r_{T}\|_{P}^{2}+\sum_{t=0}^{T-1}\|x_{t}-r_{t}\|_{Q}^{2}+\|u_{t}\|_{R}^% {2}

is minimized for some weighting matrices $Q\in\mathbb{R}^{n\times n}$ and $R\in\mathbb{R}^{m\times m}$ , and where $P\in\mathbb{R}^{n\times n}$ is the solution of the discrete algebraic Riccati equation (DARE)¹¹1The final cost matrix is taken to be $P$ for simplicity. For other values of the terminal cost matrix the results still hold up to an additional constant [2].

P=Q+A^{\top}PA-A^{\top}PB(R+B^{\top}PB)^{-1}B^{\top}PA.

(2)

The LQT problem can be recast into an equivalent LQR formulation [10] by considering instead the dynamics

e_{t+1}=Ae_{t}+Bu_{t}+w_{t},\quad\forall t\in\mathbb{N},

(3)

with $e_{t}:=x_{t}-r_{t}$ and $w_{t}:=Ar_{t}-r_{t+1}$ for all $t\in\mathbb{N}$ , and the corresponding cost function

J(e_{0},u):=\|e_{T}\|_{P}^{2}+\sum_{t=0}^{T-1}\|e_{t}\|_{Q}^{2}+\|u_{t}\|_{R}^% {2}.

(4)

When the reference trajectory $r_{t}$ , $t\in\mathbb{N}$ is known at the initial time a closed form solution for the optimal controller that solves the following optimization problem can be obtained


$\displaystyle u^{\star}=$	$\displaystyle\operatorname*{arg\,min}_{u}\quad J(e_{0},u)$	(5a)
	$\displaystyle\text{subject to}\quad\eqref{eq:lti_system_noise}\quad\forall~{}0% \leq t<T.$	(5b)

This controller, often referred to as the optimal offline noncausal controller, can be represented as a linear feedback on the current state and the future reference [11, 12].

Departing from the classical formulation of tracking control, we assume that the reference signal is unknown and is only revealed sequentially after the control input has been applied, similar to the adversarial tracking framework in [3]. In particular, for each time step $0\leq t<T$ :

1:

The state $x_{t}$ and the reference state $r_{t}$ are observed,
2:

The agent decides on an input $u_{t}$ ,

The environment decides on the next reference $r_{t+1}$ , which, in turn, determines $w_{t}$ . The error state then evolves according to (3), incurring the following cost for the agent

c_{t}(e_{t},u_{t}):=\|Ae_{t}+Bu_{t}+w_{t}\|_{Q}^{2}+\|u_{t}\|^{2}_{R}.

(6)

Note that the online cost (6), depends on the current input $u_{t}$ and the unknown disturbance $w_{t}$ , and is therefore unknown to the decision maker at timestep $t$ ; it is revealed only at time $t+1$ , after the input $u_{t}$ has been applied to the system. Our problem formulation fits the online learning framework, with the extra challenge of inherent dynamics. The goal of the controller is then to minimize the online cumulative cost²²2For consistency we require $c_{T-1}(e_{T-1},u_{T-1}):=\|Ae_{T-1}+Bu_{T-1}+w_{T-1}\|_{P}^{2}+\|u_{T-1}\|^{2% }_{R}$ . To forego unnecessary cluttering of the notation, the separate treatment of the last timestep is implied implicitly.

\sum_{t=0}^{T-1}c_{t}(e_{t},u_{t})=J(e_{0},u)-\|e_{0}\|^{2}_{Q}.

This is the same as the LQR cost without the initial state, implying that the minimizers of both problems coincide.

We quantify the finite-time performance of the algorithm through the means of dynamic regret. Consider a policy $\pi:\mathcal{I}\rightarrow\mathbb{R}^{m}$ , mapping from the available information set, $\mathcal{I}$ , to the control input space. Its dynamic regret, given a disturbance signal $w$ , is defined as

\mathcal{R}^{\pi}(w,e_{0})=J(e_{0},u^{\pi})-J(e_{0},u^{\star}),

(7)

where $u^{\pi}$ is the input generated by $\pi$ and $u^{\star}$ is given by (5).

We allow the trajectory $r_{t},\;t\in\mathbb{N}$ to be arbitrary, as long as it remains bounded.

Assumption 1 (Bounded trajectory)

There exists a $\bar{R}\in\mathbb{R}_{+}$ , such that $\|r_{t}\|\leq\bar{R}$ for all $t\in\mathbb{N}$ .

The more abruptly a trajectory changes, the harder it is to achieve good tracking performance, especially if the trajectory is unknown beforehand. To capture this inherent complexity of the problem with dynamic regret, we use the well-established notion of path length [7], [13].

Definition 2.1 (Path Length)

The path length of a reference trajectory $r_{0:T}\in\mathbb{R}^{n(T+1)}$ is $L(T)=\sum_{t=0}^{T-1}\|\Delta r_{t}\|$ , where $\Delta r_{t}=r_{t+1}-r_{t}$ .

For more random and abrupt changes in the trajectory, the path length is higher, and one expects the performance of an online algorithm to deteriorate. Likewise, an efficient algorithm should improve as path length decreases. This is captured quantitatively by showing at least a linear dependence of the algorithm’s regret on the path length. One can instead choose the complexity term to be the path length of the artificial disturbances $w_{0:T-1}\in\mathbb{R}^{n(T)}$ . The resulting path length will then decrease the closer the reference dynamics are to the given system in a certain operator norm, but will scale linearly with time in the case of a constant mismatch between the two. Since we only assume bounded references, we allow for potentially random references without any underlying dynamics. Hence, we choose $L(T)$ as our complexity term. Under the following standard assumptions on stabilisability and detectablity[14] the LQR problem is well-posed.

Assumption 2 (LQR is well-posed)

The system $(A,B)$ is stabilisable, the pair $(Q^{\frac{1}{2}},A)$ is detectable and $R\succ 0$ .

3 The SS-OGD Algorithm

We consider a control law of the following form

u_{t}=-Ke_{t}+v_{t},\quad\forall 0\leq t<T,

(8)

where $K=(R+B^{\top}PB)^{-1}B^{\top}PA$ is fixed to the optimal LQR gain, and $v_{t}$ is a correction term that should account for the unknown disturbances; we will employ online learning techniques to update the latter term.

We investigate the performance of online gradient descent based algorithms. Consider the following “naive” update

v_{t}=v_{t-1}-\alpha\nabla_{v}c_{t-1}(e_{t-1},u_{t-1}),

(9)

where $v_{t}$ is updated in the opposite direction of the gradient of the most recent cost. Here $\alpha\in R_{+}$ is the step size and the recursion starts from some $v_{0}\in\mathbb{R}^{m}$ . As the online objective is quadratic, the gradient is available in a closed form and the update can be represented as $v_{t}=v_{t-1}-2\alpha(Ru_{t-1}+B^{\top}Qe_{t})$ . For the case of a constant reference signal and an underactuated system, the algorithm can converge to a point that is not necessarily the optimal one with respect to infinite horizon cost minimization. This is due to the greedy behavior of the update that does not take into account future dynamics. In this section, we propose a simple modification to this myopic OGD update (9), called SS-OGD that accounts for this shortcoming.

To motivate the SS-OGD update, we consider the steady state solution of (3) in closed-loop with the affine control law (8) when we fix $v_{i}=\bar{v}$ and $r_{i}=\bar{r}$ for all subsequent timesteps $i\geq t$ . Defining $S:=(I-A+BK)^{-1}B$ , a closed form solution for the steady state and input is given by ³³3Note that $\bar{x}$ and $\bar{u}$ are both defined for a given $\bar{v}$ and $\bar{r}$ . The dependence is left for simplicity

\bar{x}=S\bar{v}+SK\bar{r},\qquad\bar{u}=(I-KS)(\bar{v}+K\bar{r}).

(10)

One can then find the $\bar{v}$ which will recover the optimal steady state solution by minimizing the time-averaged infinite horizon steady state cost. For $\bar{x}$ and $\bar{u}$ defined as in (10), this is equivalent to minimizing

\operatorname*{arg\,min}_{\bar{v}}\{c\left(\bar{x}-\bar{r},\bar{u}\right):=\|% \bar{x}-\bar{r}\|_{Q}^{2}+\|\bar{u}\|_{R}^{2}\},

(11)

whose gradient is given by

\nabla_{\bar{v}}c\left(\bar{x}\!-\bar{r},\bar{u}\right)=2\left((I-KS)^{\top}R% \bar{u}\!+\!S^{\top}Q(\bar{x}\!-\bar{r})\right).

(12)

Since $r$ is, in general not constant, and the steady state condition is not satisfied , we suggest a new OGD-like update rule on the bias term $v_{t}$ that is a modified version of the gradient in (12). Specifically, the feedback on the steady state error, $\bar{x}-\bar{r}$ , is replaced with the measured error, $x_{t}-r_{t}$ , and the steady state input, $\bar{u}$ , with the latest applied input, $u_{t-1}$ . This results in the following update, named SS-OGD

v_{t}=v_{t-1}-2\alpha\left((I-KS)^{\top}Ru_{t-1}+S^{\top}Qe_{t}\right).

(13)

The cost ${c}$ in (12) is defined for the steady state $\bar{x}$ and input $\bar{u}$ and is thus decoupled from the true online cost $c_{t}$ in (6) that reflects the current $e_{t}$ and $u_{t}$ . These are, in general, not at a steady state and $r_{t}$ is not constant. Thus, ${c}$ is only an auxiliary, hallucinated cost to construct the update (13).

Lemma 3.1

Under Assumption 2, (11) is strictly convex in $\bar{v}$ for any $K\in\mathbb{R}^{m\times n}$ , for which $\rho(A-BK)<1$ .

Proof 3.2.

If the matrix $I-KS$ is singular, there exists a $v\in\mathbb{R}^{n}$ , such that $v=KSv$ . Then, for $x=Sv$ , at steady state $x=Ax+B(KSv-Kx)=Ax$ . Given the detectability condition of the pair $(Q^{\frac{1}{2}},A)$ , for any unstable, or marginally stable mode of $A$ , the matrix $Q\succ 0$ . This ensures that the matrix $S^{\top}QS+(I-KS)^{\top}R(I-KS)$ is positive definite, which is equivalent to the strong convexity of (11) .

The modifications from the standard OGD can be interpreted as incorporating the dynamics information in the update rule. As we show in the following, this ensures that in the limit, if the algorithm is stable and the reference signal is constant, the SS-OGD converges to the same point as the solution of the LQR problem minimizing (4). Moreover, through the feedback on the state $e_{t}$ and input $u_{t-1}$ , the update rule (13) incorporates a proportional integral (PI) control on the measured state. This is demonstrated on a quadrotor control example in Section 5, where, with the inherent integrator dynamics of the quadrotor, the SS-OGD achieves a zero steady state error in tracking a position reference signal with a constant rate of change.

To study the SS-OGD update rule, we introduce the following evolution of the combined system optimizer dynamics

z_{t+1}=\tilde{A}z_{t}+\tilde{B}w_{t},

(14)

where $z_{t}:=[v_{t}^{\top}~{}e_{t}^{\top}]^{\top}$ , the matrices $\tilde{A}\in\mathbb{R}^{p\times p}$ and $\tilde{B}\in\mathbb{R}^{p\times n}$ are defined in Appendix 7 and $p:=m+n$ .

Assumption 3

The step size $\alpha>0$ is such that $\rho(\tilde{A})<1$ .

Since all the variables in $\tilde{A}$ are known a priori, we show that there always exists an $\alpha$ satisfying this assumption and provide a sufficient condition in Appendix 7.

The following theorem shows that, for a constant $w_{t}=\bar{w}$ for all $0\leq t<T$ , SS-OGD update (13) converges to the solution of

\begin{split}(\hat{e}_{t},&\hat{v}_{t})=~{}\operatorname*{arg\,min}_{(e,v)}% \quad\|e\|_{Q}^{2}+\left\|-Ke+v\right\|_{R}^{2}\\ &~{}\text{subject to}\;e=(A-BK)e+Bv+{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\underbrace{Ar_{t}-r_{t+1}}_{w_{t}},}\end{split}

(15)

with $\textstyle{r_{T+1}:=r_{T}}$ . The solution of (15) can be interpreted as the steady state and steady state input that minimize the infinite horizon time-averaged cost (4).

Theorem 3.3.

Under Assumptions 2 and 3, if $w_{t}=\bar{w}$ for all $t\in\mathbb{N}$ , the steady state of (14) coincides with the solution of (15).

The proof of the theorem is provided in Appendix 8. As a corollary, for a constant signal $r_{t}=\bar{r}$ , the update converges to the solution of (11). Note that this is not always true for the naive OGD update (13), as its fixed point for a fixed disturbance is not necessarily the same as (15).

4 Regret Analysis

To characterize the effectiveness of the algorithm for time-varying signals and to provide finite time guarantees, we analyze its dynamic regret and show that it scales with the path length. The next theorem summarizes this main result.

Theorem 4.1.

Under Assumptions 1, 2 and 3, the dynamic regret of the SS-OGD algorithm scales with the path length

\mathcal{R}^{\mathrm{SS-OGD}}(w,e_{0})\leq\mathcal{O}\left(1+L(T)\right).

The proof of the theorem is provided in Section 4.1 after some auxiliary results.

Lemma 4.2.

(Cost Difference Lemma [15]) For any two policies $\pi_{1},\pi_{2}$

J(e_{0},u^{\pi_{2}})-J(e_{0},u^{\pi_{1}})=\sum_{t=0}^{T-1}\mathcal{Q}_{t}^{\pi% _{1}}(e_{t}^{\pi_{2}},u_{t}^{\pi_{2}})-J_{t}(e_{t}^{\pi_{2}},u^{\pi_{1}}),

where $e_{t}^{\pi_{2}}$ is the state at time $t$ achieved by applying the policy $\pi_{2}$ , $u_{t}^{\pi_{2}}$ is the input generated by the policy $\pi_{2}$ at time t, ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{Q}}_{t}^{\pi_{1% }}(e,u)=\|e\|_{Q}^{2}+\|u\|_{R}^{2}+J_{t+1}(Ae+Bu+w_{t},u^{\pi_{1}})$ is the Q-function for policy $\pi_{1}$ and $J_{i}(e_{i},u)$ is the cost-to-go at time step $i$ , with initial state $e_{i}$ and control signal $u$ .

The proof is omitted, as it is identical to the one for Markov decision processes [15]. The following result for a general policy $\pi$ akin to the result in [16] follows.

Lemma 4.3.

Under Assumption 2, given the system dynamics (3) and cost function (4), the dynamic regret of any policy $\pi$ is given by

\mathcal{R}^{\pi}(w,e_{0})=\sum_{t=0}^{T-1}\left(u_{t}^{\pi}-u_{t}^{\star}% \right)^{\top}\left(R+B^{T}PB\right)\left(u_{t}^{\pi}-u_{t}^{\star}\right),

where $u_{t}^{\pi}$ and $u_{t}^{\star}$ denote the inputs generated by $\pi$ , and the optimal policy, both evaluated at the policy state $e_{t}^{\pi}$ at time $t$ .

Proof 4.4.

Let ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{Q}}^{\star}_{t}% (e,u)$ be the optimal Q-function, associated with the optimal control law $u_{t}^{\star}$ . Then, using Lemma 4.2 the dynamic regret of the policy $\pi$ is given by

\textstyle{\mathcal{R}^{\pi}(w,e_{0})=\sum_{t=0}^{T-1}\mathcal{Q}_{t}^{\star}(% e_{t}^{\pi},u_{t}^{\pi})-\min_{u}{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\mathcal{Q}}_{t}^{\star}(e_{t}^{\pi},u),}

(16)

i.e., a sum of differences of $\mathcal{Q}^{\star}_{t}$ , evaluated at $u_{t}^{\pi}$ and $u^{\star}$ , its minimizer. For an input, $u\in\mathbb{R}^{m}$ , and some $f\in\mathbb{R}^{m}$ , $g\in\mathbb{R}$

	$\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0% }\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{Q}}^{\star}_{t% }(e,u)$	$\displaystyle=\\|e\\|_{Q}^{2}+\\|u\\|_{R}^{2}+J_{t+1}(Ae+Bu+w_{t},u^{\star})$
		$\displaystyle=u^{\top}(R+B^{\top}PB)u+f^{\top}u+g,$

where the last equality follows from the closed form of $J_{t+1}(x,u^{\star})$ as an extended quadratic function of $x$ [12, 17]. Thus, since $u_{t}^{\star}$ minimizes an extended quadratic function, $\mathcal{Q}_{t}^{\star}(e_{t}^{\pi},u_{t}^{\pi})-\mathcal{Q}_{t}^{\star}(e_{t}% ^{\pi},u_{t}^{\star})=\|u_{t}^{\pi}-u_{t}^{\star}\|^{2}_{\left(R+B^{T}PB\right)}$ .

For future references, we also recall the Cauchy Product inequality defined for two finite series $\{a_{i}\}_{i=1}^{T}$ and $\{b_{i}\}_{i=1}^{T}$ :

\textstyle{\sum_{i=0}^{T}\left|\sum_{j=0}^{i}a_{j}b_{i-j}\right|\leq\left(\sum% _{i=0}^{T}|a_{i}|\right)\left(\sum_{j=0}^{T}|b_{j}|\right)}.

(17)

4.1 Proof of Theorem 4.1

As Lemma 4.3 suggests, the dynamic regret depends on the stepwise control input difference,

	$\displaystyle\left\\|u^{\mathrm{ogd}}_{t}-u_{t}^{\star}\right\\|=\left\\|-Ke_{t}+% v_{t}+Ke_{t}+\sum_{i=t}^{T-1}K_{w}^{i,t}w_{i}\right\\|$
	$\displaystyle\leq\underbrace{\left\\|v_{t}+\sum_{i=t}^{\infty}K_{w}^{i,t}w_{t}% \right\\|}_{s_{1,t}}+\underbrace{\left\\|\sum_{i=t}^{T-1}K_{w}^{i,t}\Delta w_{i,% t}\right\\|}_{s_{2,t}}+\underbrace{\left\\|\sum_{i=T}^{\infty}K_{w}^{i,t}w_{t}% \right\\|}_{s_{3,t}},$

where $\Delta w_{i,t}=w_{i}-w_{t}$ and for all $0\leq t\leq i<T$ ,

K_{w}^{i,t}=(R+B^{\top}PB)^{-1}B^{\top}\left((A-BK)^{\top}\right)^{i-t}P.

We proceed by bounding each of the above terms separately.

Term $\boldsymbol{s_{2,t}}$ : This captures the deviation of the artificial disturbance term from the one fixed at timestep $t$ . By noting that $\Delta w_{i,t}$ can be represented as a telescopic sum,

	$\displaystyle s_{2,t}$	$\displaystyle\leq c_{F}d\sum_{i=t}^{T-1}\lambda_{F}^{i-t}\sum_{j=t}^{i-1}\\|w_{% j+1}-w_{j}\\|$
		$\displaystyle\leq\frac{c_{F}d}{1-\lambda_{F}}\sum_{j=t}^{T-2}\\|w_{j+1}-w_{j}\\|% \lambda_{F}^{j-t},$

where $F:=A-BK$ and $d=\|(R+B^{\top}PB)^{-1}B^{\top}\|\cdot\|P\|$ . Using (17)

	$\displaystyle\sum_{t=0}^{T-1}s_{2,t}\leq\frac{c_{F}d}{1-\lambda_{F}}\sum_{t=1}% ^{T-1}\sum_{j=t}^{T-1}\\|w_{j}-w_{j-1}\\|\lambda_{F}^{j-t}$
	$\displaystyle\leq\frac{c_{F}d}{1-\lambda_{F}}\sum_{j=1}^{T-1}\sum_{t=1}^{j}\\|w% _{j}-w_{j-1}\\|\lambda_{F}^{j-t}$
	$\displaystyle\leq\frac{c_{F}d}{(1-\lambda_{F})^{2}}\sum_{j=1}^{T-1}\\|w_{j}-w_{% j-1}\\|\leq\frac{c_{F}d\left(\\|A\\|+1\right)}{(1-\lambda_{F})^{2}}\cdot L(T)$

Term $\boldsymbol{s_{3,t}}$ : This captures the effect of truncating the infinite horizon problem to a finite one

	$\displaystyle s_{3,t}\leq c_{F}d\left(\\|A\\|+1\right)\bar{R}\sum_{i=T}^{\infty}% \lambda_{F}^{i-t}\leq\frac{c_{F}d\left(\\|A\\|+1\right)\bar{R}\lambda_{F}^{T-t}}% {1-\lambda_{F}}$
	$\displaystyle\sum_{t=0}^{T-1}s_{3,t}\leq\frac{c_{F}d\left(\\|A\\|+1\right)\bar{R% }\left(1-\lambda_{F}^{T}\right)}{(1-\lambda_{F})^{2}},$

where $\bar{R}$ is defined in Assumption 1.

Term $\boldsymbol{s_{1,t}}$ : This captures the cost of performing a gradient step in the direction of the steady state solution instead of the full solution, for a fixed $w_{t}$ . Note that $-\sum_{i=t}^{\infty}K_{w}^{i,t}w_{t}$ is the solution of the following infinite horizon optimization problem and is independent of the initial state [12, 18]

\begin{split}\hat{v}_{t}&=\operatorname*{arg\,min}_{v}\lim_{T\rightarrow\infty% }\frac{1}{T}\sum_{i=0}^{T}\|e\|_{Q}^{2}+\|Ke+v\|_{R}^{2}\\ &\begin{split}\text{subject to}\quad&e=(A-BK)e+Bv+w_{t},\end{split}\end{split}

which is equivalent to (15). Hence, by Theorem 3.3

\sum_{i=t}^{\infty}K_{w}^{i,t}w_{t}=-\left[I\;0\right](I-\tilde{A})^{-1}\tilde% {B}w_{t}=-\left[I\;0\right]\hat{z}_{t},

where $\hat{z}_{t}=[\hat{v}^{\top}_{t}~{}\hat{e}^{\top}_{t}]^{\top}:=(I-\tilde{A})^{-% 1}\tilde{B}w_{t}$ is the steady state of the SS-OGD dynamics (14) for a given $w_{t}$ . This term captures the difference between the SS-OGD update term $v_{t}$ and the steady state value $\hat{v}_{t}$ for that timestep. We look at the evolution of the augmented state difference; for all $0<t\leq T$

	$\displaystyle\varepsilon_{t}$	$\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0% }\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}:=z_{t}-\hat{z}_{t}}=% \tilde{A}z_{t-1}+\tilde{B}w_{t-1}-\hat{z}_{t}$		(18)
		$\displaystyle=\tilde{A}z_{t-1}+(I-\tilde{A})\hat{z}_{t-1}-\hat{z}_{t}=\tilde{A% }\varepsilon_{t-1}+\hat{z}_{t-1}-\hat{z}_{t}.$

Then $\varepsilon_{t}=\tilde{A}^{t}\varepsilon_{0}+\sum_{i=0}^{t-1}\tilde{A}^{i}% \left(\hat{z}_{t-i-1}-\hat{z}_{t-i}\right)$ for a given time step $0\leq t\leq T$ . Under Assumption 3

\begin{split}&\|\varepsilon_{t}\|\leq c_{\tilde{A}}\lambda_{\tilde{A}}^{t}{% \color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\|\varepsilon_{0}\|}\\ &+\left\|\left(I-\tilde{A}\right)^{-1}\tilde{B}\right\|\cdot\sum_{i=0}^{t-1}% \lambda_{\tilde{A}}^{i}\left(\|A\|\|\Delta r_{t-i-1}\|+\|\Delta r_{t-i}\|% \right).\end{split}

Defining $h=\left\|\left(I-\tilde{A}\right)^{-1}\tilde{B}\right\|\left(\|A\|+1\right)$ , $b=(h+1)\bar{R}+\|x_{0}\|+\|v_{0}\|$ , and $\bar{\varepsilon}=c_{\tilde{A}}\left(b+\frac{2\bar{R}h}{1-\lambda_{\tilde{A}}}\right)$

		$\displaystyle\\|\varepsilon_{t}\\|\leq c_{\tilde{A}}b\lambda_{\tilde{A}}^{t}+c_{% \tilde{A}}h\sum_{i=0}^{t-1}\lambda_{\tilde{A}}^{i}\left(\\|\Delta r_{t-i}\\|+\\|% \Delta r_{t-1-i}\\|\right),$		(19)
		$\displaystyle s_{1,t}\leq\\|\varepsilon_{t}\\|\leq c_{\tilde{A}}b\lambda_{\tilde% {A}}^{t}+c_{\tilde{A}}h\sum_{i=0}^{t}\lambda_{\tilde{A}}^{i}\\|\Delta r_{t-i}\\|,$
		$\displaystyle\sum_{t=0}^{T-1}s_{1,t}\leq\sum_{t=0}^{T-1}\\|\varepsilon_{t}\\|% \leq\frac{c_{\tilde{A}}}{1-\lambda_{\tilde{A}}}\left(b+hL(T)\right).$

There exist $s_{2},s_{3}\in\mathbb{R}_{+}$ , such that $s_{2,t}\leq s_{2},\;s_{3,t}\leq s_{3}$ and from (19) $s_{1,t}\leq\bar{\varepsilon}$ for all $t\in\mathbb{N}$ . Using Lemma 4.3 and denoting $\bar{P}=4\|R+B^{\top}PB\|$

\displaystyle\mathcal{R}(w,e_{0})<\bar{P}\sum_{t=0}^{T-1}\left(s_{1,t}^{2}+s_{% 2,t}^{2}+s_{3,t}^{2}\right)\leq{\mathcal{O}\left(1+L(T)\right)}.

Note that, unlike the regret bound of the FOSS algorithm in [7], the constant multiplying $L(T)$ above does not depend on $\bar{R}$ , but only on system parameters. This implies that the complexity term captures only the relative distance of the references and is not amplified by their upper bounds.

4.2 Steady State Benchmark

Given Theorem 3.3, one can also compare SS-OGD to the steady state optimal solution for each timestep. Consider

\hat{u}_{t}=-K\hat{e}_{t}+\hat{v}_{t}

(20)

for all $0\leq t<T$ , where $\hat{e}_{t}$ and $\hat{v}_{t}$ solve (15). This steady state controller can be interpreted as an optimal benchmark that is decoupled from the system dynamics, has access to the current cost $c_{t}$ , and hence to the one step ahead reference, $r_{t+1}$ , and solves for its optimal, steady state solution. The following Lemma provides a side result on the regret of the SS-OGD algorithm with respect to the steady state controller (20), $\mathcal{R}_{\mathrm{SS}}^{\mathrm{SS-OGD}}(w,e_{0}):=J(e_{0},u^{\mathrm{SS-% OGD}})-J(e_{0},\hat{u})$ .

Lemma 4.5.

Under Assumptions 1, 2, and 3, the regret of the SS-OGD algorithm with respect to the steady state benchmark (20) scales with the reference path length

\mathcal{R}_{\mathrm{SS}}^{\mathrm{SS-OGD}}(w,e_{0})\leq\mathcal{O}\left(1+L(T% )\right).

Proof 4.6.

The regret can be expressed as a function of the combined error state $\varepsilon$ evolving according to (18). Defining $\tilde{Q}_{i}:=\tilde{Q}$ for all $0\leq i<T$ and $\tilde{Q}_{T}$ as in (21) in Appendix 7

\mathcal{R}_{\mathrm{SS}}^{\mathrm{SS-OGD}}(w,e_{0})\leq\|\tilde{Q}\|\left(2h% \bar{R}+\bar{\varepsilon}\right)\sum_{t=0}^{T}\|\varepsilon_{t}\|,

using (19) and $\|\hat{z}_{t}\|\leq h\bar{R},\;\forall t$ . Then

\mathcal{R}_{\mathrm{SS}}^{\mathrm{SS-OGD}}(w,e_{0})\leq\frac{c_{\tilde{A}}\|% \tilde{Q}\|\left(2h\bar{r}+\bar{\varepsilon}\right)}{1-\lambda_{\tilde{A}}}% \left(b+hL(T)\right),

using (19), and the Cauchy Product inequality (17).

5 Numerical Example

The SS-OGD algorithm is implemented on a linearized quadrotor model [19] in closed-loop with a PI velocity controller [20], to track a reference trajectory in two dimensions. In particular, we consider the following model

	$\displaystyle\leavevmode\resizebox{}{35.56593pt}{$A=\begin{bmatrix}1.000&0&0.0% 96&0&0&0.040\\ 0&1.000&0&0.096&-0.040&0\\ 0&0&0.894&0&0&0.703\\ 0&0&0&0.894&-0.703&0\\ 0&0&0&0.193&0.452&0\\ 0&0&-0.193&0&0&0.452\end{bmatrix}$},$
	$\displaystyle\leavevmode\resizebox{}{15.6491pt}{$B=\begin{bmatrix}0.004&0&0.10% 6&0&0&0.193\\ 0&0.004&0&0.106&-0.193&0\end{bmatrix}^{\top}$},$

where the state $x:=\begin{bmatrix}p_{x}&p_{y}&v_{x}&v_{y}&\beta&\rho\end{bmatrix}^{\top}$ contains the horizontal position, velocity, the roll and pitch angles of the quadrotor, and the input $u:=\begin{bmatrix}v_{x}^{t}&v_{y}^{t}\end{bmatrix}^{\top}$ sets the target horizontal velocities. We take $Q=\mathop{\rm diag}\left(100,100,1,1,0,0\right)$ and $R=0.1\cdot I$ .

In the first experiment, the drone tracks the shape of the letters IFA for an a priori unknown reference with a fixed $\Delta r_{t}$ for all timesteps. SS-OGD’s performance is compared to that of the causal CE controller that solves for the time-averaged infinite horizon steady state cost by fixing all future references to $r_{t}$ , i.e. $\textstyle{r_{i}=r_{t}},\;t<i<T$ . This is equivalent to solving (15) and fixing $r_{t+1}=r_{t}$ . The CE controller does not have access to $r_{t+1}$ , as opposed to the steady state benchmark in (20). The results in Figure 2 show that even though the CE controller appears to be tracking the reference better in the $(p_{x},p_{y})$ plot, the time plot reveals that it lags behind the reference trajectory, resulting in around $3$ times higher regret, compared to SS-OGD. As the reference signal has a constant rate of change, the double integrator dynamics of the open loop transfer function from the error to the state, allow SS-OGD to achieve perfect position tracking. When this is not the case SS-OGD again outperforms the CE controller.

Refer to caption — Figure 1: Tracking a 2-D shape with a quadrotor model. The horizontal position plot (left panel) shows the apparent better tracking of the CE controller. However, the time plot (top right panel) shows its visible time lag; by contrast SS-OGD quickly converges to the reference. This leads to a lower rate of regret for SS-OGD (bottom right panel).

In a second experiment the empirical worst-case regret as a function $T$ is calculated. For each $T$ , $60$ random reference signals are simulated and the highest value of regret is noted. The references are generated such that $\|\Delta r_{t}\|$ decreases with a constant factor of $0.99$ . This ensures a finite path length and therefore a finite regret in the limit, as shown in Figure 2, and in agreement with the upper bound in Theorem 4.1.

6 Conclusion

In this letter, we reformulate the online LQT problem as an online control problem subject to adversarial disturbances. Within this framework, we propose a novel online gradient descent-based algorithm, called SS-OGD, and show that its dynamic regret scales with the path length of the reference signal. We validate the results on numerical examples with a quadrotor model. The improvement of the regret coefficients, as well as the case where the references are generated by some unknown dynamics is left to be studied in future work.

\appendices

7 System-Optimizer Dynamics

The combined system-optimizer dynamics matrices are

\tilde{A}=\begin{bmatrix}I-\alpha M&-\alpha H\\ B&A-BK\end{bmatrix},\qquad\tilde{B}=\begin{bmatrix}-2\alpha S^{T}Q\\ I\end{bmatrix},

where $M:=2\left(S^{T}QB{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}+}(I-KS)^{\top}R\right)$ , and $H:=2(S^{T}Q(A-BK)-(I-KS)^{\top}RK)$ . The objective function in (15) can be equivalently written as $z_{t}^{\top}\tilde{Q}z_{t}$ for $0<t<T$ and as $z_{T}^{\top}\tilde{Q}_{T}z_{T}$ for $t=T$ , where

\tilde{Q}\!=\!\begin{bmatrix}R&-RK\\ -K^{\top}R&Q\!+\!K^{\top}RK\end{bmatrix}\!,~{}\tilde{Q}_{T}\!=\!\begin{bmatrix% }\boldsymbol{0}_{m\times m}&\boldsymbol{0}_{m\times n}\\ \boldsymbol{0}_{n\times m}&P\end{bmatrix}\!.

(21)

Consider the coordinate transformation $\tilde{A}_{V}:=V\tilde{A}V^{-1}$

V=\begin{bmatrix}I&\boldsymbol{0}_{m\times n}\\ -S&I\end{bmatrix},\tilde{A}_{V}=\begin{bmatrix}I-\alpha\overline{M}&-\alpha H% \\ \alpha S\overline{M}&\alpha SH+(A-BK)\end{bmatrix},

with $\overline{M}:=M+HS=2\left(S^{\top}QS+(I-KS)^{\top}R(I-KS)\right)$ positive definite, as shown in Lemma 3.1. Using the small gain theorem for interconnected systems [14], the following, along with $\alpha<2/\rho(\overline{M})$ is a sufficient condition for the stability of $\tilde{A}_{V}$ and therefore of the dynamics (14)

\alpha\cdot\frac{\|S\overline{M}\|\|H\|}{\lambda_{min}(\overline{M})}\cdot\max% _{w\in\mathbb{R}}\big{\|}\big{[}e^{jw}I-\alpha SH-(A-BK)\big{]}^{-1}\big{\|}<1.

Since $A-BK$ is stable, there always exists an arbitrarily small $\alpha>0$ such that the above is fulfilled.

8 Proof of Theorem 3.3

Given a disturbance vector $w_{t}$ and a bias input $v$ , the steady state of the dynamics (3) with the control law (8) is given by $e=Sv+\hat{S}w_{t}$ , where $\hat{S}:=(I-A+BK)^{-1}$ . Substituting this in the objective function of (15), one can confirm that the $v$ that minimizes that cost is the unique (as shown in Lemma 3.1) solution of $\left(S^{\top}QS+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}(I-KS)^{T}R(I-KS)}% \right)v=\left((I-KS)^{\top}RK\hat{S}-S^{\top}Q\hat{S}\right)w_{t}$ . Using the definition of $\tilde{A}$ , the steady state $\hat{v}$ of the SS-OGD update (13) for a constant $w_{t}$ solves $0=M\hat{v}+H(S\hat{v}+\hat{S}w_{t})+2S^{\top}Qw_{t}$ . Then

	$\displaystyle 0$	$\displaystyle=S^{T}Q({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{% rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}I}+(A-BK)\hat% {S})B\hat{v}+(I-KS)^{T}R(I-KS)\hat{v}$
		$\displaystyle+S^{T}Q(I+(A-BK)\hat{S})w_{t}-(I-KS)^{T}RK{\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\hat{S}w_{t}}.$

Since $I+(A-BK)\hat{S}=\hat{S}$ , and $S=\hat{S}B$ , the two equations coincide, leading to the unique steady state solution $\hat{v}$ .

References

[1] E. Hazan, S. Kakade, and K. Singh, “The nonstochastic control problem,” in Algorithmic Learning Theory, pp. 408–421, PMLR, 2020.
[2] D. Foster and M. Simchowitz, “Logarithmic regret for adversarial online control,” in International Conference on Machine Learning, pp. 3211–3221, PMLR, 2020.
[3] Y. Abbasi-Yadkori, P. Bartlett, and V. Kanade, “Tracking adversarial targets,” in International Conference on Machine Learning, pp. 369–377, PMLR, 2014.
[4] M. Nonhoff and M. A. Müller, “Online gradient descent for linear dynamical systems,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 945–952, 2020.
[5] M. Nonhoff, J. Köhler, and M. A. Müller, “Online convex optimization for constrained control of linear systems using a reference governor,” arXiv preprint arXiv:2211.09088, 2022.
[6] E. C. Balta, A. Iannelli, R. S. Smith, and J. Lygeros, “Regret analysis of online gradient descent-based iterative learning control with model mismatch,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 1479–1484, IEEE, 2022.
[7] Y. Li, X. Chen, and N. Li, “Online optimal control with linear dynamics and predictions: Algorithms and regret analysis,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[8] A. Hauswirth, S. Bolognani, G. Hug, and F. Dörfler, “Optimization algorithms as robust feedback controllers,” arXiv preprint arXiv:2103.11329, 2021.
[9] L. Cothren, A. M. Ospina, G. Bianchin, and E. Dall’Anese, “Online optimization of linear-time invariant dynamical systems with cost perception,” in 2022 56th Asilomar Conference on Signals, Systems, and Computers, pp. 1357–1361, IEEE, 2022.
[10] A. Karapetyan, A. Tsiamis, E. C. Balta, A. Iannelli, and J. Lygeros, “Implications of regret on stability of linear dynamical systems,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 2583–2588, 2023.
[11] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.
[12] G. Goel and B. Hassibi, “The power of linear controllers in LQR control,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 6652–6657, IEEE, 2022.
[13] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in Proceedings of the 20th international conference on machine learning (icml-03), pp. 928–936, 2003.
[14] M. Green and D. J. Limebeer, Linear robust control. Courier Corporation, 2012.
[15] S. Kakade and J. Langford, “Approximately optimal approximate reinforcement learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267–274, 2002.
[16] R. Zhang, Y. Li, and N. Li, “On the regret analysis of online LQR control with predictions,” in 2021 American Control Conference (ACC), pp. 697–703, IEEE, 2021.
[17] A. Karapetyan, A. Iannelli, and J. Lygeros, “On the regret of $\mathcal{H}_{\infty}$ control,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 6181–6186, IEEE, 2022.
[18] C. Yu, G. Shi, S.-J. Chung, Y. Yue, and A. Wierman, “The power of predictions in online control,” Advances in Neural Information Processing Systems, vol. 33, pp. 1994–2004, 2020.
[19] P. N. Beuchat, “N-rotor vehicles: modelling, control, and estimation,” ETH Zurich Research Collection, 2019.
[20] A. Karapetyan, “Distributed Control of Flying Quadrotors.” https://github.com/akarapet/admm_collision_avoidance, June 2020.

		$\displaystyle\\|\varepsilon_{t}\\|\leq c_{\tilde{A}}b\lambda_{\tilde{A}}^{t}+c_{% \tilde{A}}h\sum_{i=0}^{t-1}\lambda_{\tilde{A}}^{i}\left(\\|\Delta r_{t-i}\\|+\\|% \Delta r_{t-1-i}\\|\right),$		(19)
		$\displaystyle s_{1,t}\leq\\|\varepsilon_{t}\\|\leq c_{\tilde{A}}b\lambda_{\tilde% {A}}^{t}+c_{\tilde{A}}h\sum_{i=0}^{t}\lambda_{\tilde{A}}^{i}\\|\Delta r_{t-i}\\|,$
		$\displaystyle\sum_{t=0}^{T-1}s_{1,t}\leq\sum_{t=0}^{T-1}\\|\varepsilon_{t}\\|% \leq\frac{c_{\tilde{A}}}{1-\lambda_{\tilde{A}}}\left(b+hL(T)\right).$

Online Linear Quadratic Tracking with Regret Guarantees