Introduction
Let $A \in \mathbb{R}^{m\times n}$, $c \in \mathbb{R}^n$ and $b \in\mathbb{R}^m$, and consider a matrix game with payoff function $(x, u) \mapsto \langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle$. The Nash equilibrium problem is \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}\underset{u \in \Delta_m}{\text{maximize }}\langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle. \end{eqnarray} From the "min" player's point of view, this problem can be re-written as \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}f(x), \end{eqnarray} where the proper convex function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is defined by \begin{eqnarray} f(x) := \hat{f}(x) + \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u),\hspace{.5em}\hat{f}(x) := \langle c, x\rangle,\hspace{.5em}\hat{\phi}(u) := \langle b, u\rangle. \end{eqnarray} Of course, the dual problem is \begin{eqnarray} \underset{u \in \Delta_m}{\text{maximize }}\phi(u), \end{eqnarray} where the proper concave function $\phi: \mathbb{R}^m \rightarrow \mathbb{R}$ is defined by \begin{eqnarray} \phi(u) := -\hat{\phi}(u) + \underset{x \in \Delta_n}{\text{min }}\langle Ax, u\rangle + \hat{f}(x). \end{eqnarray} Now, for a positive scalar $\mu$, consider a smoothed version of $f$ \begin{eqnarray} \bar{f}_\mu(x) := \hat{f}(x) + \underbrace{\underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u)}_{f_\mu(x)}, \end{eqnarray}where $d_2$ is a prox-function for the simplex $\Delta_m$, with prox-center $\bar{u} \in \Delta_m$. For simplicity, take $d_2(u) := \frac{1}{2}\|u - \bar{u}\|^2 \ge 0 = d_2(\bar{u}), \forall u \in \Delta_m$ and $\bar{u} := (1/m,1/m,...,1/m) \in \Delta_m$. Define a prox-function $d_1$ for $\Delta_n$ analogously. Note that $\sigma_1 = \sigma_2 = 1$ since the $d_j$'s are $1$-strongly convex. Also, noting that the $d_j$'s attain their maximum at the kinks of the respective simplexes, one computes where $d_2$ is a prox-function (see Nesterov's paper for definition) for the $D_1 := \underset{x \in \Delta_n}{\text{max }}d_1(x) = \frac{1}{2} \times$ squared distance between any kink and $\bar{x} = \frac{1}{2}\left((1-1/n)^2 + (n-1)(1/n)^2\right) = (1 - 1 / n) / 2$. Similarly $D_2 := \underset{u \in \Delta_m}{\text{max }}d_2(u) = (1 - 1 / m) / 2$. Note that the functions $(f_\mu)_{\mu > 0}$ increase point-wise to $f$ in the limit $\mu \rightarrow 0^+$.
Ingredients for implementing the algorithm
Let us now prepare the ingredients necessary for implementing the algorithm. Definition. Given a closed convex subset $C \subseteq \mathbb{R}^m$, and a point $x \in \mathbb{R}^m$, recall the definition of the orthogonal projection of $x$ unto $C$ \begin{eqnarray} proj_C(x) := \underset{z \in C}{\text{argmin }}\frac{1}{2}\|z-x\|^2. \end{eqnarray} Geometrically, $proj_C(x)$ is the unique point of $C$ which is closest to $x$, and so $proj_C$ is a well-defined function from $\mathbb{R}^m$ to $C$. For example, if $C$ is the nonnegative $m$th-orthant $\mathbb{R}_+^m$, then $proj_C(x) \equiv (x)_+$, the component-wise maximum of $x$ and $0$.Step 1: Call to the oracle, or computation of $f(x_k)$ and $\nabla f(x_k)$. For any $x \in \mathbb{R}^n$, define $v_\mu(x) := (Ax - b) / \mu$ and $u_\mu(x) := proj_{\Delta_m}(\bar{u} + v_\mu(x))$. One computes \begin{eqnarray*} \begin{split} f_\mu(x) := \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u) &= -\underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u-\bar{u}\|^2 + \langle b - Ax, u\rangle\\ &= \frac{1}{2}\mu\|v_\mu(x)\|^2 - \underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u - (\bar{u} + v_\mu(x))\|^2, \end{split} \end{eqnarray*} by completing the square in $u$ and ignoring constant terms. One recognizes the minimization problem on the right-hand side as the orthogonal projection of the point $\bar{u} + v_\mu(x)$ unto the simplex $\Delta_m$. It is not difficult to show that $f_\mu$ is smooth with gradient $\nabla f_\mu: x \mapsto A^Tu_\mu(x)$ (one can invoke Danskin's theorem, for example) and $L_\mu: = \frac{1}{\mu\sigma_2}\|A\|^2$ is a Lipschitz constant for the latter. Thus, $\bar{f}_\mu$ is smooth and one computes \begin{eqnarray} \nabla \bar{f}_\mu(x_k) = \nabla \hat{f}(x_k) + \nabla f_\mu(x_k) = c + A^*u_\mu(x_k). \end{eqnarray} Moreover, $L_\mu$ defined above is also a Lipschitz constant for $\nabla \bar{f}_\mu$.
Step 3: Computation of $z_k$. For any $s \in \mathbb{R}^n$, one has \begin{eqnarray*} \underset{x \in \Delta_n}{\text{argmin }}Ld_1(x) + \langle s, x\rangle = \underset{x \in \Delta_n}{\text{argmin }}\frac{1}{2}L_\mu\|x-\bar{x}\|^2 + \langle s, x\rangle = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}s\right), \end{eqnarray*} by completing the square in $x$ and ignoring constant terms. Thus, with $s = \sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)$, we obtain the update rule \begin{eqnarray} z_k = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}\sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)\right). \end{eqnarray}
Step 2: Computation of $y_k$. Similarly to the previous computation, one obtains the update rule \begin{eqnarray} y_k = T_{\Delta_n}(x_k) := \underset{y \in \Delta_n}{\text{argmin }}\langle \nabla \bar{f}_\mu(x), y\rangle + \frac{1}{2}L_\mu\|y-x\|^2 = proj_{\Delta_n}\left(x_k - \frac{1}{L_\mu}\nabla \bar{f}_\mu (x_k)\right). \end{eqnarray} Stopping criterion. The algorithm is stopped once the primal-dual gap \begin{eqnarray} \begin{split} &f(x_k) - \phi(u_\mu(x_k))=\\ &\left(\langle c, x_k \rangle + \underset{u \in \Delta_m}{\text{max }}\langle Ax_k, u\rangle - \langle b, u\rangle\right) - \left(-\langle b, u_\mu(x_k)\rangle + \underset{z \in \Delta_n}{\text{min }}\langle Az, u_\mu(x_k)\rangle + \langle c, z\rangle\right)\\ &= \langle c, x_k \rangle + \underset{1 \le j \le m}{\text{max }}(Ax_k-b)_j + \langle b, u_\mu(x_k)\rangle - \underset{1 \le i \le n}{\text{min }}(A^Tu_\mu(x_k) + c)_i \end{split} \end{eqnarray} is below a tolerance $\epsilon > 0$ (say $\epsilon \sim 10^{-4}$). Note that in the above computation, we have used the fact that for any $r \in \mathbb{R}^m$, one has \begin{eqnarray} \begin{split} \underset{u \in \Delta_m}{\text{max }}\langle r, u\rangle &= \text{max of }\langle r, .\rangle\text{ on the boundary of }\Delta_m \\ &= \text{max of }\langle r, .\rangle\text{ on the line segment joining any two distinct kinks of } \Delta_m \\ &= \underset{1 \le i < j \le m}{\text{max }}\text{ }\underset{0 \le t \le 1}{\text{max }}tr_i + (1-t)r_j = \underset{1 \le i \le m}{\text{max }}r_i. \end{split} \end{eqnarray}
Algorithm parameters. In accord with Nesterov's recommendation, we may take \begin{eqnarray} \mu = \frac{\epsilon}{2D_2} = \frac{\epsilon}{2}, \text{ and }L_\mu=\frac{\|A\|^2}{\sigma_2 \mu} = \frac{\|A\|^2}{\mu}. \end{eqnarray}
No comments:
Post a Comment