In machine-learning,
feature-screening aims at detecting and eliminating irrelevant
(non-predictive) features thus reducing the size of the underly-
ing optimization problem (here problem
\eqref{eq:primal}). The general idea
is to compute for each value of the regularization parameter,
a relevance measure for each feature, which is then compared
with a threshold (produced by the screening procedure itself).
Features which fall short of this threshold are detected as
irrelevant and eliminated.
This post presents an overview of so-called dynamic screening
rules, a new generation of safe screening rules for the Lasso (and
related models like the Group-Lasso) which have appeared recently in
the literature (for example,
[Bonnefoy et al. 2014]). A particular emphasis is put on
the novel duality-gap-based screening rules due to
Gramfort and
co-authors.
We also present resent work on heuristc univariate screening
[Dohmatob et al. PRNI2015].
Notation
For a vector
v \in \mathbb{R}^p, we recall the definition of its
- l_1-norm \|v\|_1 := \sum_{1 \le j \le p}|v_j|,
- l_2-norm \|v\|_2 := (\sum_{1 \le j \le p}|v_j|^2)^{1/2}, and
- l_{\infty}-norm \|v\|_{\infty} := \max_{1 \le j \le p}|v_j|.
The transpose of a matrix
A \in \mathbb{R}^{n \times p} is denoted
A^T. The
ith row of
A is denoted
A_i. The
jth column of
A
is the
jth row of
A^T, namely
A^T_j. Finally, define the bracket
\begin{equation}
\left[a\right]_c^b := min(max(a, b), c)
\end{equation}
General considerations
Consider a Lasso model with response vector
y \in \mathbb{R}^p and
design matrix
X \in \mathbb{R}^{n \times p}. Let
\lambda
(with
0 < \lambda < \|X^Ty\|_{\infty}) be the regularization
parameter. The primal objective to be minimized as a function of the
primal variable
\beta \in \mathbb{R}^p (
\beta is the vector of
regressor coefficients) is
\begin{eqnarray}
p_{\lambda}(\beta) := \frac{1}{2}\|X\beta - y\|^2_2 +
\lambda\|\beta\|_1
\label{eq:primal}
\end{eqnarray}
In the sense of Fenchel-Rockafellar, the dual objective to be
maximized as a function of the dual variable
\theta \in \mathbb{R}^n
is
\begin{eqnarray}
d_{\lambda}(\theta) := \begin{cases}
\frac{1}{2}\|y\|^2_2 - \frac{\lambda^2}{2}\|\theta -
\frac{y}{\lambda}\|^2_2, &\mbox{if } \|X^T\theta\|_{\infty} \le 1\\
-\infty, &\mbox{otherwise}.
\end{cases}
\label{eq:dual}
\end{eqnarray}
Finally, the duality-gap
\delta_{\lambda}(\beta, \theta) at
(\beta,
\theta) is defined by
\begin{eqnarray}
\delta_{\lambda}(\beta, \theta) := p_{\lambda}(\beta) -
d_{\lambda}(\theta).
\label{eq:dgap}
\end{eqnarray}
One notes the following straightforward facts
- \delta_{\lambda}(\beta, \theta) \ge 0 with equality iff
\beta minimizes \eqref{eq:primal} and \theta maximizes
\eqref{eq:dual}. Such a primal-dual pair is called an
optimal pair.
The dual objective d_{\lambda} defined in \eqref{eq:dual} has
a unique minimizer \theta_{\lambda}^* which corresponds to the
euclidean projection of y/\lambda onto the dual-feasible
polyhedron
\begin{eqnarray}
\mathcal{P} := \{\theta \in \mathbb{R}^n | \|X^T\theta\|_{\infty} \le
1\}.
\end{eqnarray}
Note that \mathcal{P} is compact and convex.
Given an optimal primal-dual pair (\beta^*_{\lambda},
\theta^*_{\lambda}) \in \mathbb{R}^p \times \mathcal{P}, we have
the fundamental
safe (i.e screening rules which provably can't
mistakenly discard active features.) screening rule
\begin{eqnarray}
|X^T_j\theta^*_{\lambda}| < 1 \implies \beta_{\lambda,j}^* =
0\text{ (i.e the $j$th feature is irrelevant)}.
\label{eq:fundamental}
\end{eqnarray}
The inequality
\eqref{eq:fundamental} allows the possibility to
envisage constructing safe screening rules as follows:
- Construct a ``small'' compact set C \subseteq \mathbb{R}^n,
containing the dual optimal point \theta_{\lambda}^*
with C \cap \mathcal{P} \ne \emptyset, such that the
value maximum value m_{C,j} := \underset{\theta \in C}{max}\text{
}|X^T_j\theta| can be easily computed.
- Noting that m_{C,j} is an upper bound for
|X^T_j\theta^*_{\lambda}| in \eqref{eq:fundamental}, we may then
discard all features j for which m_{C,j} < 1.
The rest of this manuscript overviews methods for effectively
realizing such a construction.
Safe sphere tests
We start with the following lemma which provides a useful formula
for the duality-gap
\delta_{\lambda}(\beta, \theta) defined in
\eqref{eq:dgap}.
For every (\beta, \theta) \in \mathbb{R}^p \times \mathbb{R}^n, we
have
\begin{eqnarray}
\delta_{\lambda}(\beta, \theta) = \begin{cases}
\frac{\lambda^2}{2}\left\|\theta -
(y-X\beta)/\lambda\right\|^2_2 + \lambda\left(\|\beta\|_1 -
\theta^TX\beta\right), &\mbox{if } \theta \in \mathcal{P}\\
+\infty, &\mbox{otherwise}.
\end{cases}
\label{eq:dgap_formula}
\end{eqnarray}
Expand the formula in \eqref{eq:dgap}, and then complete the square
w.r.t \theta.
Let (\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}
be a feasible primal-dual pair. If \theta_{\lambda}^* is an optimal
dual point (i.e maximizes the dual objective p_{\lambda} defined in
\eqref{eq:dual}), then for every other dual point \theta \in
\mathbb{R}^n it holds that
\begin{eqnarray}
\left\|\theta_{\lambda}^* -
(y-X\beta)/\lambda\right\|_2 \le
\sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda.
\label{eq:dual_sphere}
\end{eqnarray}
By the optimality of \theta_{\lambda}^* for the ``marginal''
duality-gap function \eta \mapsto \delta_{\lambda}(\beta, \eta), we
have
\begin{eqnarray}
\delta_{\lambda}(\beta, \theta_{\lambda}^*) \le
\delta_{\lambda}(\beta, \theta).
\label{eq:dgap_ineq}
\end{eqnarray}
Now observe that \theta_{\lambda}^*, being the projection of the point
y/\lambda onto the dual-feasible polyhedron \mathcal{P}, lies on
the boundary of \mathcal{P} because the latter doesn't contain
y/\lambda since \lambda < \|X^Ty\|_{\infty}. Thus we have
\|X^T\theta_{\lambda}^*\|_{\infty} = 1, and by Holder's inequality
it follows that
\begin{eqnarray}
\|\beta\|_1 - \beta^TX^T\theta_{\lambda}^* =
\|\beta\|_1\|X^T\theta_{\lambda}^*\|_{\infty} -
\beta^TX^T\theta_{\lambda}^* \ge 0.
\label{eq:holder}
\end{eqnarray}
Finally, invoking formula \eqref{eq:dgap_formula} on the LHS of
\eqref{eq:dgap_ineq} and using \eqref{eq:holder} completes the proof.
Given a feasible primal-dual pair
(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}, the above theorem prescribes a trust-region for the optimal dual point
\theta_{\lambda}^*, namely the sphere
S_n(y - X\beta)/\lambda,
\sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda).
Static safe sphere test
Let's begin with the following elementary but important lemma about
the maximum value of the function
\theta \mapsto |b^T\theta| on
the sphere
\begin{eqnarray}
S_n(c,r) := \{\theta \in \mathbb{R}^n | \|\theta - c\|_2 \le r\},
\end{eqnarray}
of center
c \in \mathbb{R}^n and radius
r > 0. Viz,
\begin{eqnarray}
\underset{\theta \in S_n(c, r)}{max}\text{ }|b^T\theta| = |b^Tc| +
r\|b\|_2.
\label{eq:max_sphere}
\end{eqnarray}
One has
\begin{eqnarray*}
\underset{\theta \in S_n(c, r)}{max}\text{ }b^T\theta =
\underset{\theta \in S_n(0, r)}{max}\text{ }b^T(\theta + c) = b^Tc +
r\underset{\theta \in S_n(0, 1)}{max}\text{ }b^T\theta = b^Tc +
r\|b\|_2.
\end{eqnarray*}
Replacing b with -b in the above equation yields
\begin{eqnarray*}
\underset{\theta \in S_n(c, r)}{max}-b^T\theta = -b^Tc +
r\|b\|_2.
\end{eqnarray*}
Now, combining both equations and using the fact that
\begin{eqnarray*}
|b^T\theta| \equiv max(b^T\theta, -b^T\theta),
\end{eqnarray*}
we obtain the desired result.
The following lemma is from
[Xiang et al. 2014].
Given a safe sphere S_n(c, r) for the optimal dual point
\theta_{\lambda}^*, the following screening rule is safe
\begin{eqnarray}
\text{Discard the } j\text{th feature if }
|X_j^Tc| + r\|X_j^T\|_2 < 1.
\label{eq:sphere_test}
\end{eqnarray}
Direct application of \eqref{eq:fundamental} and
\eqref{eq:max_sphere}.
Using the trust-region
\eqref{eq:dual_sphere} established in the previous theorem for the optimal dual variable
\theta_{\lambda}^*, namely the sphere
S_n\left((y-X\beta)/\lambda,
\sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right), we can
envisage to device a screening rule in the form
\eqref{eq:sphere_test}. Indeed,
For any primal-dual pair (\beta,\theta) \in \mathbb{R}^p \times
\mathbb{R}^n, the rule
\begin{eqnarray}
\text{Discard the } j\text{th feature if }
\left|X_j^T(y-X\beta)\right| +
\sqrt{2\delta_{\lambda}(\beta,\theta)}\|X_j^T\|_2 < \lambda
\label{eq:sphere_test_actual}
\end{eqnarray}
is safe.
Use the last lemma, on the safe sphere
S_n\left((y-X\beta)/\lambda,
\sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right) obtained in
the previous theorem.
Dynamic safe sphere test
The results presented here are due to very recent work by Gramfort and
co-workers.
The screening rule in
\eqref{eq:sphere_test_actual} only makes sense
(i.e there is any hope it could ever screen some features) only if
\delta_{\lambda}(\beta,\theta) < +\infty, i.e only if
\theta is
dual-feasible. In fact, the smaller the duality-gap
\delta_{\lambda}(\beta,\theta), the more effective the screening
rule is. Thus we need a procedure which, given a primal point
\beta
\in \mathbb{R}^p, generates a dual-feasible point
\theta for which
\delta_{\lambda}(\beta,\theta) is as small as possible. As was first
mentioned in
[Bonnefoy et al. 2014], any iterative solver for
\eqref{eq:primal} can be used to produce a sequence of primal-dual
feasible pairs
(\beta^{(k)}, \theta^{(k)}) \in \mathbb{R}^p
\times \mathcal{P}, and a decreasing sequence of safe spheres
S_n\left(y / \lambda, \|\theta^{(k)} - y / \lambda\right\|_2). Indeed
for each primal iterate
\beta^{(k)}, one finds a scalar
\mu^{(k)} \in
\mathbb{R} such that
\mu^{(k)} (y - X\beta^{(k)}) \in \mathcal{P} is the
dual-feasible point closest to
y / \lambda. This sub-problem is a
simple minimization problem of quadratic function
\mu \mapsto
\|\mu(y-X\beta^{(k)}) - y / \lambda\|^2_2 on a closed real interval
\left[-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}, \frac{1}{\|X^T(y
- X\beta^{(k)})\|_{\infty}}\right], with an analytic solution
\begin{eqnarray}
\mu^{(k)} = \begin{cases}
\left[\frac{y^T(y-X\beta^{(k)})}{\lambda
\|y-X\beta^{(k)}\|_2^2}\right]^{-\frac{1}{\|X^T(y -
X\beta^{(k)})\|_{\infty}}}_{\frac{1}{\|X^T(y -
X\beta^{(k)})\|_{\infty}}}, &\mbox{if } X\beta^{(k)} \ne
y\\
1, &\mbox{otherwise}.
\end{cases}
\end{eqnarray}
The resulting algorithm ("Poorman's FISTA with dynamic screening") is depicted below.
- Input: \lambda \in \text{ }]0, \|X^Ty\|_{\infty}[ the regularization
parameter; \epsilon > 0 the desired precision on duality
gap.\\
- Initialize: \beta^{(0)} \leftarrow 0 \in \mathbb{R}^p,
\theta^{(0)} \leftarrow y/\lambda \in \mathcal{P}, \delta^{(0)}
\leftarrow+\infty, t^{(0)} \leftarrow 1, and k \leftarrow 0.
- Repeat (until \delta^{(k)} < \epsilon):
- \beta^{(k + 1)} \leftarrow soft_{\lambda/L}(\eta^{(k)} - X^T(X\eta^{(k)} -
y)), \hspace{.5em}\theta^{(k+1)} \leftarrow \mu^{(k+1)}(y - X\beta^{(k+1)})
- t^{(k+1)} \leftarrow \frac{1 + \sqrt{4t^{(k)} +
1}}{2}, \hspace{.5em}\eta^{(k+1)} \leftarrow \beta^{(k+1)} +
\frac{t^{(k)} - 1}{t^{(k+1)}}(\beta^{(k+1)} - \beta^{(k)})
- \delta^{(k+1)} \leftarrow \frac{\lambda^2}{2}\left\|\theta^{(k+1)} -
(y-X\beta^{(k+1)})/\lambda\right\|^2_2 + \lambda\left(\|\beta^{(k+1)}\|_1 -
\theta^TX\beta^{(k + 1)}\right)
- X, \beta^{(k+1)}, \eta^{(k+1)} \leftarrow
screen(X,y,\beta^{(k+1)}, \eta^{(k+1)}, \delta^{(k+1)})
- k \leftarrow k + 1
- Return \beta^{(k)}
Univariate (heuristic) screening for brain decoding problems
In
[Dohmatob et al. PRNI2015], we proposed (amongst other tricks) a univariate heuristic for detecting and disregarding irrelevant voxels in brain decoding problem (ssee figure above). This heuristic can result in upto 10-fold speedup over full-brain analysis. Get a quick overview
here.