Processing math: 100%

Monday, June 29, 2015

Details of Nesterov smoothing scheme for Nash equilibria in matrix games

Motivated by a rescent issue on stackexchange, the goal of this post is clarify some of the technical details of Nesterov's smoothing scheme, in the particular case of computing approximate Nash equilibria in matrix games. This manuscript is not novel, but simply strives to help familiarize the reader with Nesterov's smoothing scheme for minimizing non-smooth functions.

Introduction


Let A \in \mathbb{R}^{m\times n}, c \in \mathbb{R}^n and b \in\mathbb{R}^m, and consider a matrix game with payoff function (x, u) \mapsto \langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle. The Nash equilibrium problem is \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}\underset{u \in \Delta_m}{\text{maximize }}\langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle. \end{eqnarray} From the "min" player's point of view, this problem can be re-written as \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}f(x), \end{eqnarray} where the proper convex function f: \mathbb{R}^n \rightarrow \mathbb{R} is defined by \begin{eqnarray} f(x) := \hat{f}(x) + \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u),\hspace{.5em}\hat{f}(x) := \langle c, x\rangle,\hspace{.5em}\hat{\phi}(u) := \langle b, u\rangle. \end{eqnarray} Of course, the dual problem is \begin{eqnarray} \underset{u \in \Delta_m}{\text{maximize }}\phi(u), \end{eqnarray} where the proper concave function \phi: \mathbb{R}^m \rightarrow \mathbb{R} is defined by \begin{eqnarray} \phi(u) := -\hat{\phi}(u) + \underset{x \in \Delta_n}{\text{min }}\langle Ax, u\rangle + \hat{f}(x). \end{eqnarray} Now, for a positive scalar \mu, consider a smoothed version of f \begin{eqnarray} \bar{f}_\mu(x) := \hat{f}(x) + \underbrace{\underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u)}_{f_\mu(x)}, \end{eqnarray}
where d_2 is a prox-function for the simplex \Delta_m, with prox-center \bar{u} \in \Delta_m. For simplicity, take d_2(u) := \frac{1}{2}\|u - \bar{u}\|^2 \ge 0 = d_2(\bar{u}), \forall u \in \Delta_m and \bar{u} := (1/m,1/m,...,1/m) \in \Delta_m. Define a prox-function d_1 for \Delta_n analogously. Note that \sigma_1 = \sigma_2 = 1 since the d_j's are 1-strongly convex. Also, noting that the d_j's attain their maximum at the kinks of the respective simplexes, one computes where d_2 is a prox-function (see Nesterov's paper for definition) for the D_1 := \underset{x \in \Delta_n}{\text{max }}d_1(x) = \frac{1}{2} \times squared distance between any kink and \bar{x} = \frac{1}{2}\left((1-1/n)^2 + (n-1)(1/n)^2\right) = (1 - 1 / n) / 2. Similarly D_2 := \underset{u \in \Delta_m}{\text{max }}d_2(u) = (1 - 1 / m) / 2. Note that the functions (f_\mu)_{\mu > 0} increase point-wise to f in the limit \mu \rightarrow 0^+.



Ingredients for implementing the algorithm

Let us now prepare the ingredients necessary for implementing the algorithm. Definition. Given a closed convex subset C \subseteq \mathbb{R}^m, and a point x \in \mathbb{R}^m, recall the definition of the orthogonal projection of x unto C \begin{eqnarray} proj_C(x) := \underset{z \in C}{\text{argmin }}\frac{1}{2}\|z-x\|^2. \end{eqnarray} Geometrically, proj_C(x) is the unique point of C which is closest to x, and so proj_C is a well-defined function from \mathbb{R}^m to C. For example, if C is the nonnegative mth-orthant \mathbb{R}_+^m, then proj_C(x) \equiv (x)_+, the component-wise maximum of x and 0.
Step 1: Call to the oracle, or computation of f(x_k) and \nabla f(x_k). For any x \in \mathbb{R}^n, define v_\mu(x) := (Ax - b) / \mu and u_\mu(x) := proj_{\Delta_m}(\bar{u} + v_\mu(x)). One computes \begin{eqnarray*} \begin{split} f_\mu(x) := \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u) &= -\underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u-\bar{u}\|^2 + \langle b - Ax, u\rangle\\ &= \frac{1}{2}\mu\|v_\mu(x)\|^2 - \underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u - (\bar{u} + v_\mu(x))\|^2, \end{split} \end{eqnarray*} by completing the square in u and ignoring constant terms. One recognizes the minimization problem on the right-hand side as the orthogonal projection of the point \bar{u} + v_\mu(x) unto the simplex \Delta_m. It is not difficult to show that f_\mu is smooth with gradient \nabla f_\mu: x \mapsto A^Tu_\mu(x) (one can invoke Danskin's theorem, for example) and L_\mu: = \frac{1}{\mu\sigma_2}\|A\|^2 is a Lipschitz constant for the latter. Thus, \bar{f}_\mu is smooth and one computes \begin{eqnarray} \nabla \bar{f}_\mu(x_k) = \nabla \hat{f}(x_k) + \nabla f_\mu(x_k) = c + A^*u_\mu(x_k). \end{eqnarray} Moreover, L_\mu defined above is also a Lipschitz constant for \nabla \bar{f}_\mu.
Step 3: Computation of z_k. For any s \in \mathbb{R}^n, one has \begin{eqnarray*} \underset{x \in \Delta_n}{\text{argmin }}Ld_1(x) + \langle s, x\rangle = \underset{x \in \Delta_n}{\text{argmin }}\frac{1}{2}L_\mu\|x-\bar{x}\|^2 + \langle s, x\rangle = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}s\right), \end{eqnarray*} by completing the square in x and ignoring constant terms. Thus, with s = \sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i), we obtain the update rule \begin{eqnarray} z_k = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}\sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)\right). \end{eqnarray}
Step 2: Computation of y_k. Similarly to the previous computation, one obtains the update rule \begin{eqnarray} y_k = T_{\Delta_n}(x_k) := \underset{y \in \Delta_n}{\text{argmin }}\langle \nabla \bar{f}_\mu(x), y\rangle + \frac{1}{2}L_\mu\|y-x\|^2 = proj_{\Delta_n}\left(x_k - \frac{1}{L_\mu}\nabla \bar{f}_\mu (x_k)\right). \end{eqnarray} Stopping criterion. The algorithm is stopped once the primal-dual gap \begin{eqnarray} \begin{split} &f(x_k) - \phi(u_\mu(x_k))=\\ &\left(\langle c, x_k \rangle + \underset{u \in \Delta_m}{\text{max }}\langle Ax_k, u\rangle - \langle b, u\rangle\right) - \left(-\langle b, u_\mu(x_k)\rangle + \underset{z \in \Delta_n}{\text{min }}\langle Az, u_\mu(x_k)\rangle + \langle c, z\rangle\right)\\ &= \langle c, x_k \rangle + \underset{1 \le j \le m}{\text{max }}(Ax_k-b)_j + \langle b, u_\mu(x_k)\rangle - \underset{1 \le i \le n}{\text{min }}(A^Tu_\mu(x_k) + c)_i \end{split} \end{eqnarray} is below a tolerance \epsilon > 0 (say \epsilon \sim 10^{-4}). Note that in the above computation, we have used the fact that for any r \in \mathbb{R}^m, one has \begin{eqnarray} \begin{split} \underset{u \in \Delta_m}{\text{max }}\langle r, u\rangle &= \text{max of }\langle r, .\rangle\text{ on the boundary of }\Delta_m \\ &= \text{max of }\langle r, .\rangle\text{ on the line segment joining any two distinct kinks of } \Delta_m \\ &= \underset{1 \le i < j \le m}{\text{max }}\text{ }\underset{0 \le t \le 1}{\text{max }}tr_i + (1-t)r_j = \underset{1 \le i \le m}{\text{max }}r_i. \end{split} \end{eqnarray}
Algorithm parameters. In accord with Nesterov's recommendation, we may take \begin{eqnarray} \mu = \frac{\epsilon}{2D_2} = \frac{\epsilon}{2}, \text{ and }L_\mu=\frac{\|A\|^2}{\sigma_2 \mu} = \frac{\|A\|^2}{\mu}. \end{eqnarray}

Limitations and extensions

There are a number of conceptual problems with this smoothing scheme. For example, how do we select the smoothing parameter \mu? If it's too small, then the scheme becomes ill-conditioned (Lipschitz constants explode, etc.). If it's too large, then the approximation is very bad. A possible workaround is to start with a reasonably big value of \mu and decrease it gradually across iterations (as was done in Algorithm 1 of this paper, for example). In fact, Nesterov himself has proposed the "Excessive Gap Technique" (google it) as a more principled way to go about smoothing. EGT has been used by Andrew Gilpin as an alternative way to go about solving imcomplete information two-person zero-sum sequential games (In such games, the player's strategy profiles are no longer simplexes but more general polyhera.), the player's like Texas Hold'em.

No comments:

Post a Comment