CAUTION: Blog moved to dohmatob.github.io !!!: 2015

Monday, December 28, 2015

$\frac{1}{2}\|.\|^2$ is the only self-dual function on a Hilbert space $X$!

Let $X$ be a Hibert space, $x \mapsto \|x\| := \sqrt{x^Tx}$ be the euclidean norm on $X$, and $f:X \rightarrow (-\infty,+\infty]$ an extended real-valued function. Define the convex conjugate of $f$, denoted $f^*$, by
$$f^*(y) := \sup_{x \in X}x^Ty - f(x), \; \forall y \in X.$$
Note that $f^*$ is always convex l.s.c (being the supremum of affine functions) without any assumptions whatsoever on $f$.

Question: When do we have $f^* = f$ ? Checkout the answer here.

Monday, December 21, 2015

Paper accepted at ICASSP 2016!

Our math paper entitled: "LOCAL Q-LINEAR CONVERGENCE AND FINITE-TIME ACTIVE SET IDENTIFICATION OF ADMM ON A CLASS OF PENALIZED REGRESSION PROBLEMS" has been accepted for the ICASSP 2016 signal-processing conference (the largest in the world). The conference will be held at the shicc (Shanghai Convention Center).

Author manuscript available upon demand.

Exact minimization of the linearly perturbed distance of a point to a closed convex set

Let $a, b \in \mathbb R^n$, and $C$ be a nonempty closed convex subset of $\mathbb{R}^n$ with $a \not \in C$. Consider the problem $$\text{minimize }\|x-a\| + b^Tx\text{ subject to } x \in C.$$ The case $b = 0$ corresponds to the well-known problem of projecting the point $a$ onto $C$. This problem can be easy or difficult, depending on the geometry of $C$ (for example, the latter problem is rather easy if $C$ is a diamond-shaped polyhedron, or an oval). However the perturbative case $b \ne 0$ is completely non-trivial and must be treated rather carefully. That notwithstanding, we show here that the case $\|b\| < 1$ is essentially equivalennt to the non-perturbative case $b = 0$.

Saturday, November 21, 2015

Graph-Net vrs TV-L1: structure in regression coefficients

See conference paper for more theory on these multi-variate decoding / recovery models in brain science. Implementation to appear in next release of nilearn.

Monday, October 5, 2015

On the unreasonable effectiveness of Friedrichs angles: rates of convergence of powers of some weird operators

Statement of the problem

Let $d$ and $c$ be positive integers and $q = dc$. Let $G$ be a $q$-by-$q$ positive semi-definite real matrix with eigenvalues all $\le 1$, and define the $q$-by-$2q$ matrix $A = [G\hspace{1em}\mathrm{I}_q - G]$, where $\text{I}_q$ is the $q$-by-$q$ identity matrix. Now, given a scalar $\kappa > 0$ and $d$ vectors $X_1, X_2, \ldots, X_d \in \mathbb{R}^c$, with $\|X_j\| \ne \kappa \; \forall j$, consider the $q$-by-$q$ block-diagonal matrix $D$ whose $j$th block, a $c$-by-$c$ matrix, is given by \begin{equation} D_j = \begin{cases}\text{I}_c - \frac{\kappa}{\|X_j\|}\text{proj}_{\langle X_j \rangle^\perp}, &\mbox{ if } \|X_j\| > \kappa,\\ 0, &\mbox{ otherwise,}\end{cases} \end{equation} where $\text{proj}_{\langle X_j \rangle^\perp} = \mathrm{I}_c - \mathrm{proj}_{\langle X_j\rangle} = \mathrm{I}_c - \frac{1}{\|X_j\|^2}X_jX_j^T$ is the projection onto the orthogonal complement of the $1$-dimensional subspace of $\mathbb{R}^c$ spanned by the vector $X_j$. Finally define the $2q$-by-$q$ matrix $B = [D\hspace{1em}\mathrm{I}_q-D]^T$. For example, if $c=1$, then each $D_j \in \{{0,1}\}$, a bit indicating whether $|X_j| > \kappa$, and $D$ is a diagonal matrix with $0$s and $1$s accordingly. Now define the $2q$-by-$2q$ matrix $F := BA$ (Or $F := AB$. To see this equivalence, note that $(BA)^{n+1} = B(AB)^nA$, for all $n \in \mathbb{N}$. Thus to study the "convergence" of $((BA)^n)_{n \in \mathbb{N}}$, it suffices to study the convergence of $((AB)^n)_{n \in \mathbb{N}}$. In fact, if $(AB)^n \rightarrow C$, then $(BA)^n \rightarrow BCA$.).

Ultimately, I'm interested in the rate of convergence (in the sense of Definition 2.1 of this paper) of the sequence of matrix powers $(F^n)_{n\in\mathbb{N}}$. I can easily bound the spectral norm $r(F) \le \|F\| \le \|A\| \le 1$, but this turns out to be not very useful in my situation (e.g $\|A\|$ can be $1$). If I can compute the eigenvalues of $F$ (e.g in terms of the eigenvalues of $G$), then there is a fair chance I can apply the results of this paper to get rates of convergence for the aforementioned sequence.

Question

Can one relate (via a formula, for example) the eigenvalues of $F$ in terms of the eigenvalues of $G$ ? How fast does iterating $AB$ or $BA$ converge ?

Partial solution for limiting case when $G$ is the projection onto a subspace

In my initial problem, $G$ is a function of a scalar parameter, and after careful inspection, it turns out that in the limit as this parameter goes to infinity, $G$ becomes the projection operator onto a subspace of $\mathbb{R}^q$, i.e $G = \mathrm{proj}_{\text{Im }K} = K(K^TK)^{-1}K^T$ for some full column-rank $q$-by-$p$ matrix $K$ (with $p \le q$). If in addition $c=1$, then $D$ too is a projection (precisely, it is a diagonal matrix with only $0$s and $1$s, as already explained further above) and one computes \begin{eqnarray} \begin{split} F &= AB = [G\hspace{1em}\mathrm{I}_q - G][D\hspace{1em}\mathrm{I}_q - D]^T = GD + (\mathrm{I}_q - G)(\mathrm{I}_q - D)\\ &= \mathrm{proj}_U\mathrm{proj}_V + \mathrm{proj}_{U^\perp}\mathrm{proj}_{V^\perp}, \end{split} \end{eqnarray} where $U := \text{Im }K$, and $V := \text{Im }D$ (note that ${U^\perp} = \text{ker }K^T$ and ${V^\perp} = \text{ker }D^T$). It then suffices to invoke Theorem 3.10 of this [paper][2], with $\mu = 1 \in [0, 2)$ to obtain that the sequence of matrix powers $(F^n)_{n \in \mathbb{N}}$ has the optimal rate of convergence $\gamma = \cos{\theta_F}(U, V) \in [0, 1)$, the cosine of the the *Friedrichs angle* $\theta_F(U, V) \in [0,\pi/2]$ between $U$ and $V$ defined by: \begin{equation} \cos{\theta_F}(U, V) := \sup\{\langle u, v \rangle | u \in U \cap (U \cap V)^\perp, \|u\| \le 1, v \in V \cap (U \cap V)^\perp, \|v\| \le 1 \}. \end{equation}

Final words

As a pathological example, if $K$ (resp. $D$) is invertible, then $F = \mathrm{proj}_U$ (resp. $F = \mathrm{proj}_V$), and so $\cos{\theta_F}(U, V) = 0$. Consequently, $(F^n)_{n \in \mathbb{N}}$ convergences exponentially fast (in fact, in $1$ iteration!) to $F$.

Tuesday, September 15, 2015

I conjecture that there are infinitely many linear relations $p_n + p_{n + 3} = 2 p_{n + 2}$ in the sequence of primes!

For the musement of the trade alone, and not weary about real-life, in the spirit of GH Hardy, who invites us to do math as a fine art, I'm interested in constructing "random walks" on the prime numbers with certain "ergodic" properties. I've reduced my troubles to the following simple --to understand, not to solve!-- conjecture. Viz,

Let $p_1 < p_2 < p_3 < \ldots < p_n < \ldots$ be the sequence of primes (with $p_1 := 2$ as usual).

Conjecture: There are infinitely many positive integers $n$ for which
\begin{eqnarray}
p_n + p_{n + 3} - 2 p_{n + 2} = 0.
\end{eqnarray}

For example, the above relation holds for $n = 2$ since $p_2+ p_5 - 2p_4 = 3 + 7 - 2 \times 5 = 0$. Follow technical thread here.

Tuesday, August 25, 2015

What is a convex conjugate ?

In case you've missed the convex-optimization revolution (which started with Mureau, Rockafellar, etc., but has pre-history dating back to at least Kuhn, Karush, Tucker, Nash, etc.), the hottest things around are subdifferentials, convex conjugates (aka Fenchel-Legendre transforms), prox operators, and infimal convolutions. Checkout this beautiful one-page manuscript on the subject, by H.H. Bauschke. There are many cool stuff associated with this duality result, here are a few:

All primal solutions are obtainable from any dual solution! Paramount is the Fenchel duality Theorem (see previous reference), a monster result for mapping between an arbitrary dual solution and the entire set of primal solutions. You can see it in action here.
Duality between strong convexity and strong smoothness. Another paper shows a non-trivial and very useful result, Theorem 6 (it's fairly possibly this result was known earlier, but this is another debate...): a closed convex function $R$ is $\beta$-strongly convex w.r.t to a norm $\|.\|$ iff its Fenchel-Legendre transform (aka convex conjugate) $R^*$ is $\frac{1}{\beta}$-strongly smooth w.r.t the dual norm $\|.\|_*$. This result can be invoked to establish the strong convexity of some otherwise rather complicated matrix functions.

Wednesday, July 29, 2015

SpaceNet now available in nilearn master branch + will be available in upcoming release

Hurray!, SpaceNet, a collection of "sparsity + structure" inducing models for brain decoding, is now available in the master branch of nilearn and will be part of the upcoming nilearn release. For more information on the model, see my OHBM presentation and the many references therewithin.

New paper: computing Nash equilibria in two-person zero-sum games with incomplete information (e.g Texas Hold'em).

In this paper, I'm using some funky ideas from modern (post Jean-Jacques Mureau era) convex analysis to attack the Nash equlibrium problem for these complex games. Read full paper on arXiv.

Tuesday, July 21, 2015

New paper (recreational mathematics)

Here on arXiv.

Tuesday, June 30, 2015

Exponentially fast computation of Nash equilibria in matrix games: a weird inequality for polyhedral convex functions

Introduction: General considerations

Example of bounded polyhedron, aka polytope

Definition. A proper convex function (the qualificative "proper" means the function is not identically equal to $+\infty$) $g: \mathbb{R}^p \rightarrow [-\infty, +\infty]$ is said to be polyhedral if its epigraph $\mathrm{epi}(g) := \{(x, t) \in \mathbb{R}^{p+1} | t \ge g(x)\}$ is a (convex) polyhedron.

We have the following useful characterization (see [section 2, Murota et al. 2013]): $g$ is polyhedral iff there exists a matrix $A = (a_i) \in \mathbb{R}^{m \times n}$ and a vector $c = (c_i) \in \mathbb{R}^m$ s.t $g(x) = \text{max}\{\langle a_i, x \rangle + c_i| 1 \le i \le m\}$, $\forall x \in dom(g)$. As usual, the set $dom(g) := \{x \in \mathbb{R}^p | g(x) < +\infty\}$ is the effective domain of $g$, and is non-empty since $g$ is proper, by assumption. Given a compact convex subset $C \subseteq dom(g)$, let $\mathrm{Opt}_C(g) \subseteq C$ be the set of minimizers of $g$ on $C$. When there is no confusion about the set $C$, we will omit the subscript and simply write $\mathrm{Opt}(g)$. Note that $\mathrm{Opt}_C(g)$ is itself convex.

Minimax for matrix games and the weird inequality

It's not hard to see that the Nash equilibrium problem for a matrix game with payoff matrix $A \in \mathbb{R}^{m \times n}$
\begin{eqnarray}
\underset{x \in \Delta_m}{\text{minimize }}\underset{y \in \Delta_n}{\text{maximize }}\langle Ax, y\rangle
\end{eqnarray}
can be written as
\begin{eqnarray}
\underset{(x,y) \in \Delta_m \times \Delta_n}{\text{minimize }}G(x, y),
\end{eqnarray}
where
\begin{eqnarray}
\begin{split}
G(x, y) & := \underset{v \in \Delta_n}{\text{max }}\langle Ax, v\rangle - \underset{u \in \Delta_n}{\text{min }}\langle Au, y\rangle = max\{\langle Ax, v\rangle - \langle Au, y\rangle | (u,v) \in \Delta_m \times \Delta_n\} \\
&= \text{max}\{\langle r_i, [x\text{ }y]^T \rangle | 1 \le i \le m + n\}
\end{split}
\end{eqnarray} is the primal-dual gap function and $(r_i) = R := \begin{bmatrix}0\hspace{1.5em}A\\-A^T\hspace{.4em}0\end{bmatrix} \in \mathbb{R}^{(m+n)^2}$. Thus $G$ is a polyhedral convex function on $\Delta_m \times \Delta_n$. Note that $G(x, y) \ge 0$ for every pair of strategies $(x, y) \in \Delta_m \times \Delta_n$, with equality iff $(x,y) \in \mathrm{Opt}(G)$. Also note that the elements of $\mathrm{Opt}(G)$ are simply the Nash equilibria of the game.

Definition. For a given tolerance $\epsilon > 0$, we call a pair of strategies $(x^*,y^*) \in \Delta_m \times \Delta_n$ a Nash $\epsilon$-equilibrium if $G(x^*, y^*) \le \epsilon$.

A remarkable result by [Andrew Gilpin et al.] (Lemma 3) says that for any matrix game, there exists a scalar $\delta > 0$ such that
\begin{eqnarray} \mathrm{dist}((x, y), \mathrm{Opt}(G)) \le \dfrac{G(x, y)}{\delta}, \forall (x, y) \in \Delta_m \times \Delta_n, \label{eq:weirdo}
\end{eqnarray}

where $\mathrm{dist}((x, y), \mathrm{Opt}(G)) := \mathrm{min}\{\|(u,v) - (x,y)\|| (u, v) \in \mathrm{Opt}(G)\}$ is the distance of $(x, y)$ from $\mathrm{Opt}(G)$. Using this bound in conjunction with Nesterov smoothing, the authors derived an $\mathcal{O}\left(\left(\|A\| / \delta\right) ln\left(1 / \epsilon\right)\right)$ algorithm for computing a Nash $\epsilon$-equilibrium. Note however that, the suprimum of the $\delta$s veryfin \eqref{eq:weirdo} can be arbitrarily small in fact, we can explicitly construct games for which it is arbitrarily close to $0$), and so the exponential speed may not be experienced empirically on some ill-conditioned games. To be continued...

Monday, June 29, 2015

PRNI & OHBM 2015

PRNI 2015 @ stanford was great! Great keynotes (Candès, Tibshirani, etc.) + great place. OHBM in Honolulu, well the idea alone seams surreal. Hawaii is marvelous. Great conference too.

These are slides for my

Here are some photos.

The future of screening

Feature-screening à la [Dohmatob et al. PRNI2015]

In machine-learning, feature-screening aims at detecting and eliminating irrelevant (non-predictive) features thus reducing the size of the underly- ing optimization problem (here problem \eqref{eq:primal}). The general idea is to compute for each value of the regularization parameter, a relevance measure for each feature, which is then compared with a threshold (produced by the screening procedure itself). Features which fall short of this threshold are detected as irrelevant and eliminated. This post presents an overview of so-called dynamic screening rules, a new generation of safe screening rules for the Lasso (and related models like the Group-Lasso) which have appeared recently in the literature (for example, [Bonnefoy et al. 2014]). A particular emphasis is put on the novel duality-gap-based screening rules due to Gramfort and co-authors.

We also present resent work on heuristc univariate screening [Dohmatob et al. PRNI2015].

Notation

For a vector $v \in \mathbb{R}^p$, we recall the definition of its

$l_1$-norm $\|v\|_1 := \sum_{1 \le j \le p}|v_j|$,
$l_2$-norm $\|v\|_2 := (\sum_{1 \le j \le p}|v_j|^2)^{1/2}$, and
$l_{\infty}$-norm $\|v\|_{\infty} := \max_{1 \le j \le p}|v_j|$.

The transpose of a matrix $A \in \mathbb{R}^{n \times p}$ is denoted $A^T$. The $i$th row of $A$ is denoted $A_i$. The $j$th column of $A$ is the $j$th row of $A^T$, namely $A^T_j$. Finally, define the bracket \begin{equation} \left[a\right]_c^b := min(max(a, b), c) \end{equation}

General considerations

Consider a Lasso model with response vector $y \in \mathbb{R}^p$ and design matrix $X \in \mathbb{R}^{n \times p}$. Let $\lambda$ (with $0 < \lambda < \|X^Ty\|_{\infty}$) be the regularization parameter. The primal objective to be minimized as a function of the primal variable $\beta \in \mathbb{R}^p$ ($\beta$ is the vector of regressor coefficients) is \begin{eqnarray} p_{\lambda}(\beta) := \frac{1}{2}\|X\beta - y\|^2_2 + \lambda\|\beta\|_1 \label{eq:primal} \end{eqnarray} In the sense of Fenchel-Rockafellar, the dual objective to be maximized as a function of the dual variable $\theta \in \mathbb{R}^n$ is \begin{eqnarray} d_{\lambda}(\theta) := \begin{cases} \frac{1}{2}\|y\|^2_2 - \frac{\lambda^2}{2}\|\theta - \frac{y}{\lambda}\|^2_2, &\mbox{if } \|X^T\theta\|_{\infty} \le 1\\ -\infty, &\mbox{otherwise}. \end{cases} \label{eq:dual} \end{eqnarray} Finally, the duality-gap $\delta_{\lambda}(\beta, \theta)$ at $(\beta, \theta)$ is defined by \begin{eqnarray} \delta_{\lambda}(\beta, \theta) := p_{\lambda}(\beta) - d_{\lambda}(\theta). \label{eq:dgap} \end{eqnarray} One notes the following straightforward facts

$\delta_{\lambda}(\beta, \theta) \ge 0$ with equality iff $\beta$ minimizes \eqref{eq:primal} and $\theta$ maximizes \eqref{eq:dual}. Such a primal-dual pair is called an optimal pair.

The dual objective $d_{\lambda}$ defined in \eqref{eq:dual} has a unique minimizer $\theta_{\lambda}^*$ which corresponds to the euclidean projection of $y/\lambda$ onto the dual-feasible polyhedron \begin{eqnarray} \mathcal{P} := \{\theta \in \mathbb{R}^n | \|X^T\theta\|_{\infty} \le 1\}. \end{eqnarray} Note that $\mathcal{P}$ is compact and convex.

Given an optimal primal-dual pair $(\beta^*_{\lambda}, \theta^*_{\lambda}) \in \mathbb{R}^p \times \mathcal{P}$, we have the fundamental safe (i.e screening rules which provably can't mistakenly discard active features.) screening rule \begin{eqnarray} |X^T_j\theta^*_{\lambda}| < 1 \implies \beta_{\lambda,j}^* = 0\text{ (i.e the $j$th feature is irrelevant)}. \label{eq:fundamental} \end{eqnarray}

The inequality \eqref{eq:fundamental} allows the possibility to envisage constructing safe screening rules as follows:

Construct a ``small'' compact set $C \subseteq \mathbb{R}^n$, containing the dual optimal point $\theta_{\lambda}^*$ with $C \cap \mathcal{P} \ne \emptyset$, such that the value maximum value $m_{C,j} := \underset{\theta \in C}{max}\text{ }|X^T_j\theta|$ can be easily computed.
Noting that $m_{C,j}$ is an upper bound for $|X^T_j\theta^*_{\lambda}|$ in \eqref{eq:fundamental}, we may then discard all features $j$ for which $m_{C,j} < 1$.

The rest of this manuscript overviews methods for effectively realizing such a construction.

Safe sphere tests

We start with the following lemma which provides a useful formula for the duality-gap $\delta_{\lambda}(\beta, \theta)$ defined in \eqref{eq:dgap}.

For every $(\beta, \theta) \in \mathbb{R}^p \times \mathbb{R}^n$, we have \begin{eqnarray} \delta_{\lambda}(\beta, \theta) = \begin{cases} \frac{\lambda^2}{2}\left\|\theta - (y-X\beta)/\lambda\right\|^2_2 + \lambda\left(\|\beta\|_1 - \theta^TX\beta\right), &\mbox{if } \theta \in \mathcal{P}\\ +\infty, &\mbox{otherwise}. \end{cases} \label{eq:dgap_formula} \end{eqnarray}

Expand the formula in \eqref{eq:dgap}, and then complete the square w.r.t $\theta$.

Let $(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}$ be a feasible primal-dual pair. If $\theta_{\lambda}^*$ is an optimal dual point (i.e maximizes the dual objective $p_{\lambda}$ defined in \eqref{eq:dual}), then for every other dual point $\theta \in \mathbb{R}^n$ it holds that \begin{eqnarray} \left\|\theta_{\lambda}^* - (y-X\beta)/\lambda\right\|_2 \le \sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda. \label{eq:dual_sphere} \end{eqnarray}

By the optimality of $\theta_{\lambda}^*$ for the ``marginal'' duality-gap function $\eta \mapsto \delta_{\lambda}(\beta, \eta)$, we have \begin{eqnarray} \delta_{\lambda}(\beta, \theta_{\lambda}^*) \le \delta_{\lambda}(\beta, \theta). \label{eq:dgap_ineq} \end{eqnarray} Now observe that $\theta_{\lambda}^*$, being the projection of the point $y/\lambda$ onto the dual-feasible polyhedron $\mathcal{P}$, lies on the boundary of $\mathcal{P}$ because the latter doesn't contain $y/\lambda$ since $\lambda < \|X^Ty\|_{\infty}$. Thus we have $\|X^T\theta_{\lambda}^*\|_{\infty} = 1$, and by Holder's inequality it follows that \begin{eqnarray} \|\beta\|_1 - \beta^TX^T\theta_{\lambda}^* = \|\beta\|_1\|X^T\theta_{\lambda}^*\|_{\infty} - \beta^TX^T\theta_{\lambda}^* \ge 0. \label{eq:holder} \end{eqnarray} Finally, invoking formula \eqref{eq:dgap_formula} on the LHS of \eqref{eq:dgap_ineq} and using \eqref{eq:holder} completes the proof.

Given a feasible primal-dual pair $(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}$, the above theorem prescribes a trust-region for the optimal dual point $\theta_{\lambda}^*$, namely the sphere $S_n(y - X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda)$.

Static safe sphere test

Let's begin with the following elementary but important lemma about the maximum value of the function $\theta \mapsto |b^T\theta|$ on the sphere \begin{eqnarray} S_n(c,r) := \{\theta \in \mathbb{R}^n | \|\theta - c\|_2 \le r\}, \end{eqnarray} of center $c \in \mathbb{R}^n$ and radius $r > 0$. Viz,

\begin{eqnarray} \underset{\theta \in S_n(c, r)}{max}\text{ }|b^T\theta| = |b^Tc| + r\|b\|_2. \label{eq:max_sphere} \end{eqnarray}

One has \begin{eqnarray*} \underset{\theta \in S_n(c, r)}{max}\text{ }b^T\theta = \underset{\theta \in S_n(0, r)}{max}\text{ }b^T(\theta + c) = b^Tc + r\underset{\theta \in S_n(0, 1)}{max}\text{ }b^T\theta = b^Tc + r\|b\|_2. \end{eqnarray*} Replacing $b$ with $-b$ in the above equation yields \begin{eqnarray*} \underset{\theta \in S_n(c, r)}{max}-b^T\theta = -b^Tc + r\|b\|_2. \end{eqnarray*} Now, combining both equations and using the fact that \begin{eqnarray*} |b^T\theta| \equiv max(b^T\theta, -b^T\theta), \end{eqnarray*} we obtain the desired result.

The following lemma is from [Xiang et al. 2014].

Given a safe sphere $S_n(c, r)$ for the optimal dual point $\theta_{\lambda}^*$, the following screening rule is safe \begin{eqnarray} \text{Discard the } j\text{th feature if } |X_j^Tc| + r\|X_j^T\|_2 < 1. \label{eq:sphere_test} \end{eqnarray}

Direct application of \eqref{eq:fundamental} and \eqref{eq:max_sphere}.

Using the trust-region \eqref{eq:dual_sphere} established in the previous theorem for the optimal dual variable $\theta_{\lambda}^*$, namely the sphere $S_n\left((y-X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right)$, we can envisage to device a screening rule in the form \eqref{eq:sphere_test}. Indeed,

For any primal-dual pair $(\beta,\theta) \in \mathbb{R}^p \times \mathbb{R}^n$, the rule \begin{eqnarray} \text{Discard the } j\text{th feature if } \left|X_j^T(y-X\beta)\right| + \sqrt{2\delta_{\lambda}(\beta,\theta)}\|X_j^T\|_2 < \lambda \label{eq:sphere_test_actual} \end{eqnarray} is safe.

Use the last lemma, on the safe sphere $S_n\left((y-X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right)$ obtained in the previous theorem.

Dynamic safe sphere test

The results presented here are due to very recent work by Gramfort and co-workers. The screening rule in \eqref{eq:sphere_test_actual} only makes sense (i.e there is any hope it could ever screen some features) only if $\delta_{\lambda}(\beta,\theta) < +\infty$, i.e only if $\theta$ is dual-feasible. In fact, the smaller the duality-gap $\delta_{\lambda}(\beta,\theta)$, the more effective the screening rule is. Thus we need a procedure which, given a primal point $\beta \in \mathbb{R}^p$, generates a dual-feasible point $\theta$ for which $\delta_{\lambda}(\beta,\theta)$ is as small as possible. As was first mentioned in [Bonnefoy et al. 2014], any iterative solver for \eqref{eq:primal} can be used to produce a sequence of primal-dual feasible pairs $(\beta^{(k)}, \theta^{(k)}) \in \mathbb{R}^p \times \mathcal{P}$, and a decreasing sequence of safe spheres $S_n\left(y / \lambda, \|\theta^{(k)} - y / \lambda\right\|_2)$. Indeed for each primal iterate $\beta^{(k)}$, one finds a scalar $\mu^{(k)} \in \mathbb{R}$ such that $\mu^{(k)} (y - X\beta^{(k)}) \in \mathcal{P}$ is the dual-feasible point closest to $y / \lambda$. This sub-problem is a simple minimization problem of quadratic function $\mu \mapsto \|\mu(y-X\beta^{(k)}) - y / \lambda\|^2_2$ on a closed real interval $\left[-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}, \frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}\right]$, with an analytic solution \begin{eqnarray} \mu^{(k)} = \begin{cases} \left[\frac{y^T(y-X\beta^{(k)})}{\lambda \|y-X\beta^{(k)}\|_2^2}\right]^{-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}_{\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}, &\mbox{if } X\beta^{(k)} \ne y\\ 1, &\mbox{otherwise}. \end{cases} \end{eqnarray} The resulting algorithm ("Poorman's FISTA with dynamic screening") is depicted below.

Input: $\lambda \in \text{ }]0, \|X^Ty\|_{\infty}[$ the regularization parameter; $\epsilon > 0$ the desired precision on duality gap.\\
Initialize: $\beta^{(0)} \leftarrow 0 \in \mathbb{R}^p$, $\theta^{(0)} \leftarrow y/\lambda \in \mathcal{P}$, $\delta^{(0)} \leftarrow+\infty$, $t^{(0)} \leftarrow 1$, and $k \leftarrow 0$.
Repeat (until $\delta^{(k)} < \epsilon$):
- $\beta^{(k + 1)} \leftarrow soft_{\lambda/L}(\eta^{(k)} - X^T(X\eta^{(k)} - y)), \hspace{.5em}\theta^{(k+1)} \leftarrow \mu^{(k+1)}(y - X\beta^{(k+1)})$
- $t^{(k+1)} \leftarrow \frac{1 + \sqrt{4t^{(k)} + 1}}{2}, \hspace{.5em}\eta^{(k+1)} \leftarrow \beta^{(k+1)} + \frac{t^{(k)} - 1}{t^{(k+1)}}(\beta^{(k+1)} - \beta^{(k)})$
- $\delta^{(k+1)} \leftarrow \frac{\lambda^2}{2}\left\|\theta^{(k+1)} - (y-X\beta^{(k+1)})/\lambda\right\|^2_2 + \lambda\left(\|\beta^{(k+1)}\|_1 - \theta^TX\beta^{(k + 1)}\right)$
- $X, \beta^{(k+1)}, \eta^{(k+1)} \leftarrow screen(X,y,\beta^{(k+1)}, \eta^{(k+1)}, \delta^{(k+1)})$
- $k \leftarrow k + 1$
Return $\beta^{(k)}$

One can show that the above dynamic screening procedure those no harm to the convergence theory the iterative solver.
A coordinate-descent solver would be more appropriate. We presented FISTA here for simplicity rather than applicability

Univariate (heuristic) screening for brain decoding problems

In [Dohmatob et al. PRNI2015], we proposed (amongst other tricks) a univariate heuristic for detecting and disregarding irrelevant voxels in brain decoding problem (ssee figure above). This heuristic can result in upto 10-fold speedup over full-brain analysis. Get a quick overview here.

Details of Nesterov smoothing scheme for Nash equilibria in matrix games

Motivated by a rescent issue on stackexchange, the goal of this post is clarify some of the technical details of Nesterov's smoothing scheme, in the particular case of computing approximate Nash equilibria in matrix games. This manuscript is not novel, but simply strives to help familiarize the reader with Nesterov's smoothing scheme for minimizing non-smooth functions.

Introduction

Let $A \in \mathbb{R}^{m\times n}$, $c \in \mathbb{R}^n$ and $b \in\mathbb{R}^m$, and consider a matrix game with payoff function $(x, u) \mapsto \langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle$. The Nash equilibrium problem is \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}\underset{u \in \Delta_m}{\text{maximize }}\langle Ax, u\rangle + \langle c, x\rangle + \langle b, u\rangle. \end{eqnarray} From the "min" player's point of view, this problem can be re-written as \begin{eqnarray} \underset{x \in \Delta_n}{\text{minimize }}f(x), \end{eqnarray} where the proper convex function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is defined by \begin{eqnarray} f(x) := \hat{f}(x) + \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u),\hspace{.5em}\hat{f}(x) := \langle c, x\rangle,\hspace{.5em}\hat{\phi}(u) := \langle b, u\rangle. \end{eqnarray} Of course, the dual problem is \begin{eqnarray} \underset{u \in \Delta_m}{\text{maximize }}\phi(u), \end{eqnarray} where the proper concave function $\phi: \mathbb{R}^m \rightarrow \mathbb{R}$ is defined by \begin{eqnarray} \phi(u) := -\hat{\phi}(u) + \underset{x \in \Delta_n}{\text{min }}\langle Ax, u\rangle + \hat{f}(x). \end{eqnarray} Now, for a positive scalar $\mu$, consider a smoothed version of $f$ \begin{eqnarray} \bar{f}_\mu(x) := \hat{f}(x) + \underbrace{\underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u)}_{f_\mu(x)}, \end{eqnarray}

where $d_2$ is a prox-function for the simplex $\Delta_m$, with prox-center $\bar{u} \in \Delta_m$. For simplicity, take $d_2(u) := \frac{1}{2}\|u - \bar{u}\|^2 \ge 0 = d_2(\bar{u}), \forall u \in \Delta_m$ and $\bar{u} := (1/m,1/m,...,1/m) \in \Delta_m$. Define a prox-function $d_1$ for $\Delta_n$ analogously. Note that $\sigma_1 = \sigma_2 = 1$ since the $d_j$'s are $1$-strongly convex. Also, noting that the $d_j$'s attain their maximum at the kinks of the respective simplexes, one computes where $d_2$ is a prox-function (see Nesterov's paper for definition) for the $D_1 := \underset{x \in \Delta_n}{\text{max }}d_1(x) = \frac{1}{2} \times$ squared distance between any kink and $\bar{x} = \frac{1}{2}\left((1-1/n)^2 + (n-1)(1/n)^2\right) = (1 - 1 / n) / 2$. Similarly $D_2 := \underset{u \in \Delta_m}{\text{max }}d_2(u) = (1 - 1 / m) / 2$. Note that the functions $(f_\mu)_{\mu > 0}$ increase point-wise to $f$ in the limit $\mu \rightarrow 0^+$.

Ingredients for implementing the algorithm

Let us now prepare the ingredients necessary for implementing the algorithm. Definition. Given a closed convex subset $C \subseteq \mathbb{R}^m$, and a point $x \in \mathbb{R}^m$, recall the definition of the orthogonal projection of $x$ unto $C$ \begin{eqnarray} proj_C(x) := \underset{z \in C}{\text{argmin }}\frac{1}{2}\|z-x\|^2. \end{eqnarray} Geometrically, $proj_C(x)$ is the unique point of $C$ which is closest to $x$, and so $proj_C$ is a well-defined function from $\mathbb{R}^m$ to $C$. For example, if $C$ is the nonnegative $m$th-orthant $\mathbb{R}_+^m$, then $proj_C(x) \equiv (x)_+$, the component-wise maximum of $x$ and $0$.
Step 1: Call to the oracle, or computation of $f(x_k)$ and $\nabla f(x_k)$. For any $x \in \mathbb{R}^n$, define $v_\mu(x) := (Ax - b) / \mu$ and $u_\mu(x) := proj_{\Delta_m}(\bar{u} + v_\mu(x))$. One computes \begin{eqnarray*} \begin{split} f_\mu(x) := \underset{u \in \Delta_m}{\text{max }}\langle Ax, u\rangle - \hat{\phi}(u) - \mu d_2(u) &= -\underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u-\bar{u}\|^2 + \langle b - Ax, u\rangle\\ &= \frac{1}{2}\mu\|v_\mu(x)\|^2 - \underset{u \in \Delta_m}{\text{min }}\frac{1}{2}\mu\|u - (\bar{u} + v_\mu(x))\|^2, \end{split} \end{eqnarray*} by completing the square in $u$ and ignoring constant terms. One recognizes the minimization problem on the right-hand side as the orthogonal projection of the point $\bar{u} + v_\mu(x)$ unto the simplex $\Delta_m$. It is not difficult to show that $f_\mu$ is smooth with gradient $\nabla f_\mu: x \mapsto A^Tu_\mu(x)$ (one can invoke Danskin's theorem, for example) and $L_\mu: = \frac{1}{\mu\sigma_2}\|A\|^2$ is a Lipschitz constant for the latter. Thus, $\bar{f}_\mu$ is smooth and one computes \begin{eqnarray} \nabla \bar{f}_\mu(x_k) = \nabla \hat{f}(x_k) + \nabla f_\mu(x_k) = c + A^*u_\mu(x_k). \end{eqnarray} Moreover, $L_\mu$ defined above is also a Lipschitz constant for $\nabla \bar{f}_\mu$.
Step 3: Computation of $z_k$. For any $s \in \mathbb{R}^n$, one has \begin{eqnarray*} \underset{x \in \Delta_n}{\text{argmin }}Ld_1(x) + \langle s, x\rangle = \underset{x \in \Delta_n}{\text{argmin }}\frac{1}{2}L_\mu\|x-\bar{x}\|^2 + \langle s, x\rangle = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}s\right), \end{eqnarray*} by completing the square in $x$ and ignoring constant terms. Thus, with $s = \sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)$, we obtain the update rule \begin{eqnarray} z_k = proj_{\Delta_n}\left(\bar{x} - \frac{1}{L_\mu}\sum_{i=0}^k\frac{i+1}{2}\nabla \bar{f}_\mu(x_i)\right). \end{eqnarray}
Step 2: Computation of $y_k$. Similarly to the previous computation, one obtains the update rule \begin{eqnarray} y_k = T_{\Delta_n}(x_k) := \underset{y \in \Delta_n}{\text{argmin }}\langle \nabla \bar{f}_\mu(x), y\rangle + \frac{1}{2}L_\mu\|y-x\|^2 = proj_{\Delta_n}\left(x_k - \frac{1}{L_\mu}\nabla \bar{f}_\mu (x_k)\right). \end{eqnarray} Stopping criterion. The algorithm is stopped once the primal-dual gap \begin{eqnarray} \begin{split} &f(x_k) - \phi(u_\mu(x_k))=\\ &\left(\langle c, x_k \rangle + \underset{u \in \Delta_m}{\text{max }}\langle Ax_k, u\rangle - \langle b, u\rangle\right) - \left(-\langle b, u_\mu(x_k)\rangle + \underset{z \in \Delta_n}{\text{min }}\langle Az, u_\mu(x_k)\rangle + \langle c, z\rangle\right)\\ &= \langle c, x_k \rangle + \underset{1 \le j \le m}{\text{max }}(Ax_k-b)_j + \langle b, u_\mu(x_k)\rangle - \underset{1 \le i \le n}{\text{min }}(A^Tu_\mu(x_k) + c)_i \end{split} \end{eqnarray} is below a tolerance $\epsilon > 0$ (say $\epsilon \sim 10^{-4}$). Note that in the above computation, we have used the fact that for any $r \in \mathbb{R}^m$, one has \begin{eqnarray} \begin{split} \underset{u \in \Delta_m}{\text{max }}\langle r, u\rangle &= \text{max of }\langle r, .\rangle\text{ on the boundary of }\Delta_m \\ &= \text{max of }\langle r, .\rangle\text{ on the line segment joining any two distinct kinks of } \Delta_m \\ &= \underset{1 \le i < j \le m}{\text{max }}\text{ }\underset{0 \le t \le 1}{\text{max }}tr_i + (1-t)r_j = \underset{1 \le i \le m}{\text{max }}r_i. \end{split} \end{eqnarray}
Algorithm parameters. In accord with Nesterov's recommendation, we may take \begin{eqnarray} \mu = \frac{\epsilon}{2D_2} = \frac{\epsilon}{2}, \text{ and }L_\mu=\frac{\|A\|^2}{\sigma_2 \mu} = \frac{\|A\|^2}{\mu}. \end{eqnarray}

Limitations and extensions

There are a number of conceptual problems with this smoothing scheme. For example, how do we select the smoothing parameter $\mu$? If it's too small, then the scheme becomes ill-conditioned (Lipschitz constants explode, etc.). If it's too large, then the approximation is very bad. A possible workaround is to start with a reasonably big value of $\mu$ and decrease it gradually across iterations (as was done in Algorithm 1 of this paper, for example). In fact, Nesterov himself has proposed the "Excessive Gap Technique" (google it) as a more principled way to go about smoothing. EGT has been used by Andrew Gilpin as an alternative way to go about solving imcomplete information two-person zero-sum sequential games (In such games, the player's strategy profiles are no longer simplexes but more general polyhera.), the player's like Texas Hold'em.