CAUTION: Blog moved to dohmatob.github.io !!!: 2015-06-28

Tuesday, June 30, 2015

Exponentially fast computation of Nash equilibria in matrix games: a weird inequality for polyhedral convex functions

Introduction: General considerations

Example of bounded polyhedron, aka polytope

Definition. A proper convex function (the qualificative "proper" means the function is not identically equal to $+\infty$) $g: \mathbb{R}^p \rightarrow [-\infty, +\infty]$ is said to be polyhedral if its epigraph $\mathrm{epi}(g) := \{(x, t) \in \mathbb{R}^{p+1} | t \ge g(x)\}$ is a (convex) polyhedron.

We have the following useful characterization (see [section 2, Murota et al. 2013]): $g$ is polyhedral iff there exists a matrix $A = (a_i) \in \mathbb{R}^{m \times n}$ and a vector $c = (c_i) \in \mathbb{R}^m$ s.t $g(x) = \text{max}\{\langle a_i, x \rangle + c_i| 1 \le i \le m\}$, $\forall x \in dom(g)$. As usual, the set $dom(g) := \{x \in \mathbb{R}^p | g(x) < +\infty\}$ is the effective domain of $g$, and is non-empty since $g$ is proper, by assumption. Given a compact convex subset $C \subseteq dom(g)$, let $\mathrm{Opt}_C(g) \subseteq C$ be the set of minimizers of $g$ on $C$. When there is no confusion about the set $C$, we will omit the subscript and simply write $\mathrm{Opt}(g)$. Note that $\mathrm{Opt}_C(g)$ is itself convex.

Minimax for matrix games and the weird inequality

It's not hard to see that the Nash equilibrium problem for a matrix game with payoff matrix $A \in \mathbb{R}^{m \times n}$
\begin{eqnarray}
\underset{x \in \Delta_m}{\text{minimize }}\underset{y \in \Delta_n}{\text{maximize }}\langle Ax, y\rangle
\end{eqnarray}
can be written as
\begin{eqnarray}
\underset{(x,y) \in \Delta_m \times \Delta_n}{\text{minimize }}G(x, y),
\end{eqnarray}
where
\begin{eqnarray}
\begin{split}
G(x, y) & := \underset{v \in \Delta_n}{\text{max }}\langle Ax, v\rangle - \underset{u \in \Delta_n}{\text{min }}\langle Au, y\rangle = max\{\langle Ax, v\rangle - \langle Au, y\rangle | (u,v) \in \Delta_m \times \Delta_n\} \\
&= \text{max}\{\langle r_i, [x\text{ }y]^T \rangle | 1 \le i \le m + n\}
\end{split}
\end{eqnarray} is the primal-dual gap function and $(r_i) = R := \begin{bmatrix}0\hspace{1.5em}A\\-A^T\hspace{.4em}0\end{bmatrix} \in \mathbb{R}^{(m+n)^2}$. Thus $G$ is a polyhedral convex function on $\Delta_m \times \Delta_n$. Note that $G(x, y) \ge 0$ for every pair of strategies $(x, y) \in \Delta_m \times \Delta_n$, with equality iff $(x,y) \in \mathrm{Opt}(G)$. Also note that the elements of $\mathrm{Opt}(G)$ are simply the Nash equilibria of the game.

Definition. For a given tolerance $\epsilon > 0$, we call a pair of strategies $(x^*,y^*) \in \Delta_m \times \Delta_n$ a Nash $\epsilon$-equilibrium if $G(x^*, y^*) \le \epsilon$.

A remarkable result by [Andrew Gilpin et al.] (Lemma 3) says that for any matrix game, there exists a scalar $\delta > 0$ such that
\begin{eqnarray} \mathrm{dist}((x, y), \mathrm{Opt}(G)) \le \dfrac{G(x, y)}{\delta}, \forall (x, y) \in \Delta_m \times \Delta_n, \label{eq:weirdo}
\end{eqnarray}

where $\mathrm{dist}((x, y), \mathrm{Opt}(G)) := \mathrm{min}\{\|(u,v) - (x,y)\|| (u, v) \in \mathrm{Opt}(G)\}$ is the distance of $(x, y)$ from $\mathrm{Opt}(G)$. Using this bound in conjunction with Nesterov smoothing, the authors derived an $\mathcal{O}\left(\left(\|A\| / \delta\right) ln\left(1 / \epsilon\right)\right)$ algorithm for computing a Nash $\epsilon$-equilibrium. Note however that, the suprimum of the $\delta$s veryfin \eqref{eq:weirdo} can be arbitrarily small in fact, we can explicitly construct games for which it is arbitrarily close to $0$), and so the exponential speed may not be experienced empirically on some ill-conditioned games. To be continued...

Monday, June 29, 2015

PRNI & OHBM 2015

PRNI 2015 @ stanford was great! Great keynotes (Candès, Tibshirani, etc.) + great place. OHBM in Honolulu, well the idea alone seams surreal. Hawaii is marvelous. Great conference too.

These are slides for my

Here are some photos.

The future of screening

Feature-screening à la [Dohmatob et al. PRNI2015]

In machine-learning, feature-screening aims at detecting and eliminating irrelevant (non-predictive) features thus reducing the size of the underly- ing optimization problem (here problem \eqref{eq:primal}). The general idea is to compute for each value of the regularization parameter, a relevance measure for each feature, which is then compared with a threshold (produced by the screening procedure itself). Features which fall short of this threshold are detected as irrelevant and eliminated. This post presents an overview of so-called dynamic screening rules, a new generation of safe screening rules for the Lasso (and related models like the Group-Lasso) which have appeared recently in the literature (for example, [Bonnefoy et al. 2014]). A particular emphasis is put on the novel duality-gap-based screening rules due to Gramfort and co-authors.

We also present resent work on heuristc univariate screening [Dohmatob et al. PRNI2015].

Notation

For a vector $v \in \mathbb{R}^p$, we recall the definition of its

$l_1$-norm $\|v\|_1 := \sum_{1 \le j \le p}|v_j|$,
$l_2$-norm $\|v\|_2 := (\sum_{1 \le j \le p}|v_j|^2)^{1/2}$, and
$l_{\infty}$-norm $\|v\|_{\infty} := \max_{1 \le j \le p}|v_j|$.

The transpose of a matrix $A \in \mathbb{R}^{n \times p}$ is denoted $A^T$. The $i$th row of $A$ is denoted $A_i$. The $j$th column of $A$ is the $j$th row of $A^T$, namely $A^T_j$. Finally, define the bracket \begin{equation} \left[a\right]_c^b := min(max(a, b), c) \end{equation}

General considerations

Consider a Lasso model with response vector $y \in \mathbb{R}^p$ and design matrix $X \in \mathbb{R}^{n \times p}$. Let $\lambda$ (with $0 < \lambda < \|X^Ty\|_{\infty}$) be the regularization parameter. The primal objective to be minimized as a function of the primal variable $\beta \in \mathbb{R}^p$ ($\beta$ is the vector of regressor coefficients) is \begin{eqnarray} p_{\lambda}(\beta) := \frac{1}{2}\|X\beta - y\|^2_2 + \lambda\|\beta\|_1 \label{eq:primal} \end{eqnarray} In the sense of Fenchel-Rockafellar, the dual objective to be maximized as a function of the dual variable $\theta \in \mathbb{R}^n$ is \begin{eqnarray} d_{\lambda}(\theta) := \begin{cases} \frac{1}{2}\|y\|^2_2 - \frac{\lambda^2}{2}\|\theta - \frac{y}{\lambda}\|^2_2, &\mbox{if } \|X^T\theta\|_{\infty} \le 1\\ -\infty, &\mbox{otherwise}. \end{cases} \label{eq:dual} \end{eqnarray} Finally, the duality-gap $\delta_{\lambda}(\beta, \theta)$ at $(\beta, \theta)$ is defined by \begin{eqnarray} \delta_{\lambda}(\beta, \theta) := p_{\lambda}(\beta) - d_{\lambda}(\theta). \label{eq:dgap} \end{eqnarray} One notes the following straightforward facts

$\delta_{\lambda}(\beta, \theta) \ge 0$ with equality iff $\beta$ minimizes \eqref{eq:primal} and $\theta$ maximizes \eqref{eq:dual}. Such a primal-dual pair is called an optimal pair.

The dual objective $d_{\lambda}$ defined in \eqref{eq:dual} has a unique minimizer $\theta_{\lambda}^*$ which corresponds to the euclidean projection of $y/\lambda$ onto the dual-feasible polyhedron \begin{eqnarray} \mathcal{P} := \{\theta \in \mathbb{R}^n | \|X^T\theta\|_{\infty} \le 1\}. \end{eqnarray} Note that $\mathcal{P}$ is compact and convex.

Given an optimal primal-dual pair $(\beta^*_{\lambda}, \theta^*_{\lambda}) \in \mathbb{R}^p \times \mathcal{P}$, we have the fundamental safe (i.e screening rules which provably can't mistakenly discard active features.) screening rule \begin{eqnarray} |X^T_j\theta^*_{\lambda}| < 1 \implies \beta_{\lambda,j}^* = 0\text{ (i.e the $j$th feature is irrelevant)}. \label{eq:fundamental} \end{eqnarray}

The inequality \eqref{eq:fundamental} allows the possibility to envisage constructing safe screening rules as follows:

Construct a ``small'' compact set $C \subseteq \mathbb{R}^n$, containing the dual optimal point $\theta_{\lambda}^*$ with $C \cap \mathcal{P} \ne \emptyset$, such that the value maximum value $m_{C,j} := \underset{\theta \in C}{max}\text{ }|X^T_j\theta|$ can be easily computed.
Noting that $m_{C,j}$ is an upper bound for $|X^T_j\theta^*_{\lambda}|$ in \eqref{eq:fundamental}, we may then discard all features $j$ for which $m_{C,j} < 1$.

The rest of this manuscript overviews methods for effectively realizing such a construction.

Safe sphere tests

We start with the following lemma which provides a useful formula for the duality-gap $\delta_{\lambda}(\beta, \theta)$ defined in \eqref{eq:dgap}.

For every $(\beta, \theta) \in \mathbb{R}^p \times \mathbb{R}^n$, we have \begin{eqnarray} \delta_{\lambda}(\beta, \theta) = \begin{cases} \frac{\lambda^2}{2}\left\|\theta - (y-X\beta)/\lambda\right\|^2_2 + \lambda\left(\|\beta\|_1 - \theta^TX\beta\right), &\mbox{if } \theta \in \mathcal{P}\\ +\infty, &\mbox{otherwise}. \end{cases} \label{eq:dgap_formula} \end{eqnarray}

Expand the formula in \eqref{eq:dgap}, and then complete the square w.r.t $\theta$.

Let $(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}$ be a feasible primal-dual pair. If $\theta_{\lambda}^*$ is an optimal dual point (i.e maximizes the dual objective $p_{\lambda}$ defined in \eqref{eq:dual}), then for every other dual point $\theta \in \mathbb{R}^n$ it holds that \begin{eqnarray} \left\|\theta_{\lambda}^* - (y-X\beta)/\lambda\right\|_2 \le \sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda. \label{eq:dual_sphere} \end{eqnarray}

By the optimality of $\theta_{\lambda}^*$ for the ``marginal'' duality-gap function $\eta \mapsto \delta_{\lambda}(\beta, \eta)$, we have \begin{eqnarray} \delta_{\lambda}(\beta, \theta_{\lambda}^*) \le \delta_{\lambda}(\beta, \theta). \label{eq:dgap_ineq} \end{eqnarray} Now observe that $\theta_{\lambda}^*$, being the projection of the point $y/\lambda$ onto the dual-feasible polyhedron $\mathcal{P}$, lies on the boundary of $\mathcal{P}$ because the latter doesn't contain $y/\lambda$ since $\lambda < \|X^Ty\|_{\infty}$. Thus we have $\|X^T\theta_{\lambda}^*\|_{\infty} = 1$, and by Holder's inequality it follows that \begin{eqnarray} \|\beta\|_1 - \beta^TX^T\theta_{\lambda}^* = \|\beta\|_1\|X^T\theta_{\lambda}^*\|_{\infty} - \beta^TX^T\theta_{\lambda}^* \ge 0. \label{eq:holder} \end{eqnarray} Finally, invoking formula \eqref{eq:dgap_formula} on the LHS of \eqref{eq:dgap_ineq} and using \eqref{eq:holder} completes the proof.

Given a feasible primal-dual pair $(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}$, the above theorem prescribes a trust-region for the optimal dual point $\theta_{\lambda}^*$, namely the sphere $S_n(y - X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda)$.

Static safe sphere test

Let's begin with the following elementary but important lemma about the maximum value of the function $\theta \mapsto |b^T\theta|$ on the sphere \begin{eqnarray} S_n(c,r) := \{\theta \in \mathbb{R}^n | \|\theta - c\|_2 \le r\}, \end{eqnarray} of center $c \in \mathbb{R}^n$ and radius $r > 0$. Viz,

\begin{eqnarray} \underset{\theta \in S_n(c, r)}{max}\text{ }|b^T\theta| = |b^Tc| + r\|b\|_2. \label{eq:max_sphere} \end{eqnarray}

One has \begin{eqnarray*} \underset{\theta \in S_n(c, r)}{max}\text{ }b^T\theta = \underset{\theta \in S_n(0, r)}{max}\text{ }b^T(\theta + c) = b^Tc + r\underset{\theta \in S_n(0, 1)}{max}\text{ }b^T\theta = b^Tc + r\|b\|_2. \end{eqnarray*} Replacing $b$ with $-b$ in the above equation yields \begin{eqnarray*} \underset{\theta \in S_n(c, r)}{max}-b^T\theta = -b^Tc + r\|b\|_2. \end{eqnarray*} Now, combining both equations and using the fact that \begin{eqnarray*} |b^T\theta| \equiv max(b^T\theta, -b^T\theta), \end{eqnarray*} we obtain the desired result.

The following lemma is from [Xiang et al. 2014].

Given a safe sphere $S_n(c, r)$ for the optimal dual point $\theta_{\lambda}^*$, the following screening rule is safe \begin{eqnarray} \text{Discard the } j\text{th feature if } |X_j^Tc| + r\|X_j^T\|_2 < 1. \label{eq:sphere_test} \end{eqnarray}

Direct application of \eqref{eq:fundamental} and \eqref{eq:max_sphere}.

Using the trust-region \eqref{eq:dual_sphere} established in the previous theorem for the optimal dual variable $\theta_{\lambda}^*$, namely the sphere $S_n\left((y-X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right)$, we can envisage to device a screening rule in the form \eqref{eq:sphere_test}. Indeed,

For any primal-dual pair $(\beta,\theta) \in \mathbb{R}^p \times \mathbb{R}^n$, the rule \begin{eqnarray} \text{Discard the } j\text{th feature if } \left|X_j^T(y-X\beta)\right| + \sqrt{2\delta_{\lambda}(\beta,\theta)}\|X_j^T\|_2 < \lambda \label{eq:sphere_test_actual} \end{eqnarray} is safe.

Use the last lemma, on the safe sphere $S_n\left((y-X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right)$ obtained in the previous theorem.

Dynamic safe sphere test

The results presented here are due to very recent work by Gramfort and co-workers. The screening rule in \eqref{eq:sphere_test_actual} only makes sense (i.e there is any hope it could ever screen some features) only if $\delta_{\lambda}(\beta,\theta) < +\infty$, i.e only if $\theta$ is dual-feasible. In fact, the smaller the duality-gap $\delta_{\lambda}(\beta,\theta)$, the more effective the screening rule is. Thus we need a procedure which, given a primal point $\beta \in \mathbb{R}^p$, generates a dual-feasible point $\theta$ for which $\delta_{\lambda}(\beta,\theta)$ is as small as possible. As was first mentioned in [Bonnefoy et al. 2014], any iterative solver for \eqref{eq:primal} can be used to produce a sequence of primal-dual feasible pairs $(\beta^{(k)}, \theta^{(k)}) \in \mathbb{R}^p \times \mathcal{P}$, and a decreasing sequence of safe spheres $S_n\left(y / \lambda, \|\theta^{(k)} - y / \lambda\right\|_2)$. Indeed for each primal iterate $\beta^{(k)}$, one finds a scalar $\mu^{(k)} \in \mathbb{R}$ such that $\mu^{(k)} (y - X\beta^{(k)}) \in \mathcal{P}$ is the dual-feasible point closest to $y / \lambda$. This sub-problem is a simple minimization problem of quadratic function $\mu \mapsto \|\mu(y-X\beta^{(k)}) - y / \lambda\|^2_2$ on a closed real interval $\left[-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}, \frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}\right]$, with an analytic solution \begin{eqnarray} \mu^{(k)} = \begin{cases} \left[\frac{y^T(y-X\beta^{(k)})}{\lambda \|y-X\beta^{(k)}\|_2^2}\right]^{-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}_{\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}, &\mbox{if } X\beta^{(k)} \ne y\\ 1, &\mbox{otherwise}. \end{cases} \end{eqnarray} The resulting algorithm ("Poorman's FISTA with dynamic screening") is depicted below.

Input: $\lambda \in \text{ }]0, \|X^Ty\|_{\infty}[$ the regularization parameter; $\epsilon > 0$ the desired precision on duality gap.\\
Initialize: $\beta^{(0)} \leftarrow 0 \in \mathbb{R}^p$, $\theta^{(0)} \leftarrow y/\lambda \in \mathcal{P}$, $\delta^{(0)} \leftarrow+\infty$, $t^{(0)} \leftarrow 1$, and $k \leftarrow 0$.
Repeat (until $\delta^{(k)} < \epsilon$):
- $\beta^{(k + 1)} \leftarrow soft_{\lambda/L}(\eta^{(k)} - X^T(X\eta^{(k)} - y)), \hspace{.5em}\theta^{(k+1)} \leftarrow \mu^{(k+1)}(y - X\beta^{(k+1)})$
- $t^{(k+1)} \leftarrow \frac{1 + \sqrt{4t^{(k)} + 1}}{2}, \hspace{.5em}\eta^{(k+1)} \leftarrow \beta^{(k+1)} + \frac{t^{(k)} - 1}{t^{(k+1)}}(\beta^{(k+1)} - \beta^{(k)})$
- $\delta^{(k+1)} \leftarrow \frac{\lambda^2}{2}\left\|\theta^{(k+1)} - (y-X\beta^{(k+1)})/\lambda\right\|^2_2 + \lambda\left(\|\beta^{(k+1)}\|_1 - \theta^TX\beta^{(k + 1)}\right)$
- $X, \beta^{(k+1)}, \eta^{(k+1)} \leftarrow screen(X,y,\beta^{(k+1)}, \eta^{(k+1)}, \delta^{(k+1)})$
- $k \leftarrow k + 1$
Return $\beta^{(k)}$

One can show that the above dynamic screening procedure those no harm to the convergence theory the iterative solver.
A coordinate-descent solver would be more appropriate. We presented FISTA here for simplicity rather than applicability

Univariate (heuristic) screening for brain decoding problems

In [Dohmatob et al. PRNI2015], we proposed (amongst other tricks) a univariate heuristic for detecting and disregarding irrelevant voxels in brain decoding problem (ssee figure above). This heuristic can result in upto 10-fold speedup over full-brain analysis. Get a quick overview here.

Details of Nesterov smoothing scheme for Nash equilibria in matrix games

Motivated by a rescent issue on stackexchange, the goal of this post is clarify some of the technical details of Nesterov's smoothing scheme, in the particular case of computing approximate Nash equilibria in matrix games. This manuscript is not novel, but simply strives to help familiarize the reader with Nesterov's smoothing scheme for minimizing non-smooth functions.

Introduction

Let $A \in \mathbb{R}^{m\times n}$, $c \in \mathbb{R}^n$ and $b \in\mathbb{R}^m$, and consider a matrix game with payoff function $(x, u) \mapsto \langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle$. The Nash equilibrium problem is \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}\underset{u \in \Delta_m}{\text{maximize }}\langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle. \end{eqnarray} From the "min" player's point of view, this problem can be re-written as \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}f(x), \end{eqnarray} where the proper convex function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is defined by \begin{eqnarray} f(x) := \hat{f}(x) + \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u),\hspace{.5em}\hat{f}(x) := \langle c, x\rangle,\hspace{.5em}\hat{\phi}(u) := \langle b, u\rangle. \end{eqnarray} Of course, the dual problem is \begin{eqnarray} \underset{u \in \Delta_m}{\text{maximize }}\phi(u), \end{eqnarray} where the proper concave function $\phi: \mathbb{R}^m \rightarrow \mathbb{R}$ is defined by \begin{eqnarray} \phi(u) := -\hat{\phi}(u) + \underset{x \in \Delta_n}{\text{min }}\langle Ax, u\rangle + \hat{f}(x). \end{eqnarray} Now, for a positive scalar $\mu$, consider a smoothed version of $f$ \begin{eqnarray} \bar{f}_\mu(x) := \hat{f}(x) + \underbrace{\underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u)}_{f_\mu(x)}, \end{eqnarray}

where $d_2$ is a prox-function for the simplex $\Delta_m$, with prox-center $\bar{u} \in \Delta_m$. For simplicity, take $d_2(u) := \frac{1}{2}\|u - \bar{u}\|^2 \ge 0 = d_2(\bar{u}), \forall u \in \Delta_m$ and $\bar{u} := (1/m,1/m,...,1/m) \in \Delta_m$. Define a prox-function $d_1$ for $\Delta_n$ analogously. Note that $\sigma_1 = \sigma_2 = 1$ since the $d_j$'s are $1$-strongly convex. Also, noting that the $d_j$'s attain their maximum at the kinks of the respective simplexes, one computes where $d_2$ is a prox-function (see Nesterov's paper for definition) for the $D_1 := \underset{x \in \Delta_n}{\text{max }}d_1(x) = \frac{1}{2} \times$ squared distance between any kink and $\bar{x} = \frac{1}{2}\left((1-1/n)^2 + (n-1)(1/n)^2\right) = (1 - 1 / n) / 2$. Similarly $D_2 := \underset{u \in \Delta_m}{\text{max }}d_2(u) = (1 - 1 / m) / 2$. Note that the functions $(f_\mu)_{\mu > 0}$ increase point-wise to $f$ in the limit $\mu \rightarrow 0^+$.

Ingredients for implementing the algorithm

Let us now prepare the ingredients necessary for implementing the algorithm. Definition. Given a closed convex subset $C \subseteq \mathbb{R}^m$, and a point $x \in \mathbb{R}^m$, recall the definition of the orthogonal projection of $x$ unto $C$ \begin{eqnarray} proj_C(x) := \underset{z \in C}{\text{argmin }}\frac{1}{2}\|z-x\|^2. \end{eqnarray} Geometrically, $proj_C(x)$ is the unique point of $C$ which is closest to $x$, and so $proj_C$ is a well-defined function from $\mathbb{R}^m$ to $C$. For example, if $C$ is the nonnegative $m$th-orthant $\mathbb{R}_+^m$, then $proj_C(x) \equiv (x)_+$, the component-wise maximum of $x$ and $0$.
Step 1: Call to the oracle, or computation of $f(x_k)$ and $\nabla f(x_k)$. For any $x \in \mathbb{R}^n$, define $v_\mu(x) := (Ax - b) / \mu$ and $u_\mu(x) := proj_{\Delta_m}(\bar{u} + v_\mu(x))$. One computes \begin{eqnarray*} \begin{split} f_\mu(x) := \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u) &= -\underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u-\bar{u}\|^2 + \langle b - Ax, u\rangle\\ &= \frac{1}{2}\mu\|v_\mu(x)\|^2 - \underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u - (\bar{u} + v_\mu(x))\|^2, \end{split} \end{eqnarray*} by completing the square in $u$ and ignoring constant terms. One recognizes the minimization problem on the right-hand side as the orthogonal projection of the point $\bar{u} + v_\mu(x)$ unto the simplex $\Delta_m$. It is not difficult to show that $f_\mu$ is smooth with gradient $\nabla f_\mu: x \mapsto A^Tu_\mu(x)$ (one can invoke Danskin's theorem, for example) and $L_\mu: = \frac{1}{\mu\sigma_2}\|A\|^2$ is a Lipschitz constant for the latter. Thus, $\bar{f}_\mu$ is smooth and one computes \begin{eqnarray} \nabla \bar{f}_\mu(x_k) = \nabla \hat{f}(x_k) + \nabla f_\mu(x_k) = c + A^*u_\mu(x_k). \end{eqnarray} Moreover, $L_\mu$ defined above is also a Lipschitz constant for $\nabla \bar{f}_\mu$.
Step 3: Computation of $z_k$. For any $s \in \mathbb{R}^n$, one has \begin{eqnarray*} \underset{x \in \Delta_n}{\text{argmin }}Ld_1(x) + \langle s, x\rangle = \underset{x \in \Delta_n}{\text{argmin }}\frac{1}{2}L_\mu\|x-\bar{x}\|^2 + \langle s, x\rangle = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}s\right), \end{eqnarray*} by completing the square in $x$ and ignoring constant terms. Thus, with $s = \sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)$, we obtain the update rule \begin{eqnarray} z_k = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}\sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)\right). \end{eqnarray}
Step 2: Computation of $y_k$. Similarly to the previous computation, one obtains the update rule \begin{eqnarray} y_k = T_{\Delta_n}(x_k) := \underset{y \in \Delta_n}{\text{argmin }}\langle \nabla \bar{f}_\mu(x), y\rangle + \frac{1}{2}L_\mu\|y-x\|^2 = proj_{\Delta_n}\left(x_k - \frac{1}{L_\mu}\nabla \bar{f}_\mu (x_k)\right). \end{eqnarray} Stopping criterion. The algorithm is stopped once the primal-dual gap \begin{eqnarray} \begin{split} &f(x_k) - \phi(u_\mu(x_k))=\\ &\left(\langle c, x_k \rangle + \underset{u \in \Delta_m}{\text{max }}\langle Ax_k, u\rangle - \langle b, u\rangle\right) - \left(-\langle b, u_\mu(x_k)\rangle + \underset{z \in \Delta_n}{\text{min }}\langle Az, u_\mu(x_k)\rangle + \langle c, z\rangle\right)\\ &= \langle c, x_k \rangle + \underset{1 \le j \le m}{\text{max }}(Ax_k-b)_j + \langle b, u_\mu(x_k)\rangle - \underset{1 \le i \le n}{\text{min }}(A^Tu_\mu(x_k) + c)_i \end{split} \end{eqnarray} is below a tolerance $\epsilon > 0$ (say $\epsilon \sim 10^{-4}$). Note that in the above computation, we have used the fact that for any $r \in \mathbb{R}^m$, one has \begin{eqnarray} \begin{split} \underset{u \in \Delta_m}{\text{max }}\langle r, u\rangle &= \text{max of }\langle r, .\rangle\text{ on the boundary of }\Delta_m \\ &= \text{max of }\langle r, .\rangle\text{ on the line segment joining any two distinct kinks of } \Delta_m \\ &= \underset{1 \le i < j \le m}{\text{max }}\text{ }\underset{0 \le t \le 1}{\text{max }}tr_i + (1-t)r_j = \underset{1 \le i \le m}{\text{max }}r_i. \end{split} \end{eqnarray}
Algorithm parameters. In accord with Nesterov's recommendation, we may take \begin{eqnarray} \mu = \frac{\epsilon}{2D_2} = \frac{\epsilon}{2}, \text{ and }L_\mu=\frac{\|A\|^2}{\sigma_2 \mu} = \frac{\|A\|^2}{\mu}. \end{eqnarray}

Limitations and extensions

There are a number of conceptual problems with this smoothing scheme. For example, how do we select the smoothing parameter $\mu$? If it's too small, then the scheme becomes ill-conditioned (Lipschitz constants explode, etc.). If it's too large, then the approximation is very bad. A possible workaround is to start with a reasonably big value of $\mu$ and decrease it gradually across iterations (as was done in Algorithm 1 of this paper, for example). In fact, Nesterov himself has proposed the "Excessive Gap Technique" (google it) as a more principled way to go about smoothing. EGT has been used by Andrew Gilpin as an alternative way to go about solving imcomplete information two-person zero-sum sequential games (In such games, the player's strategy profiles are no longer simplexes but more general polyhera.), the player's like Texas Hold'em.