Feature-screening à la [Dohmatob et al. PRNI2015] |
We also present resent work on heuristc univariate screening [Dohmatob et al. PRNI2015].
Notation
For a vector $v \in \mathbb{R}^p$, we recall the definition of its- $l_1$-norm $\|v\|_1 := \sum_{1 \le j \le p}|v_j|$,
- $l_2$-norm $\|v\|_2 := (\sum_{1 \le j \le p}|v_j|^2)^{1/2}$, and
- $l_{\infty}$-norm $\|v\|_{\infty} := \max_{1 \le j \le p}|v_j|$.
General considerations
Consider a Lasso model with response vector $y \in \mathbb{R}^p$ and design matrix $X \in \mathbb{R}^{n \times p}$. Let $\lambda$ (with $0 < \lambda < \|X^Ty\|_{\infty}$) be the regularization parameter. The primal objective to be minimized as a function of the primal variable $\beta \in \mathbb{R}^p$ ($\beta$ is the vector of regressor coefficients) is \begin{eqnarray} p_{\lambda}(\beta) := \frac{1}{2}\|X\beta - y\|^2_2 + \lambda\|\beta\|_1 \label{eq:primal} \end{eqnarray} In the sense of Fenchel-Rockafellar, the dual objective to be maximized as a function of the dual variable $\theta \in \mathbb{R}^n$ is \begin{eqnarray} d_{\lambda}(\theta) := \begin{cases} \frac{1}{2}\|y\|^2_2 - \frac{\lambda^2}{2}\|\theta - \frac{y}{\lambda}\|^2_2, &\mbox{if } \|X^T\theta\|_{\infty} \le 1\\ -\infty, &\mbox{otherwise}. \end{cases} \label{eq:dual} \end{eqnarray} Finally, the duality-gap $\delta_{\lambda}(\beta, \theta)$ at $(\beta, \theta)$ is defined by \begin{eqnarray} \delta_{\lambda}(\beta, \theta) := p_{\lambda}(\beta) - d_{\lambda}(\theta). \label{eq:dgap} \end{eqnarray} One notes the following straightforward facts- $\delta_{\lambda}(\beta, \theta) \ge 0$ with equality iff $\beta$ minimizes \eqref{eq:primal} and $\theta$ maximizes \eqref{eq:dual}. Such a primal-dual pair is called an optimal pair.
- Construct a ``small'' compact set $C \subseteq \mathbb{R}^n$, containing the dual optimal point $\theta_{\lambda}^*$ with $C \cap \mathcal{P} \ne \emptyset$, such that the value maximum value $m_{C,j} := \underset{\theta \in C}{max}\text{ }|X^T_j\theta|$ can be easily computed.
- Noting that $m_{C,j}$ is an upper bound for $|X^T_j\theta^*_{\lambda}|$ in \eqref{eq:fundamental}, we may then discard all features $j$ for which $m_{C,j} < 1$.
Safe sphere tests
We start with the following lemma which provides a useful formula for the duality-gap $\delta_{\lambda}(\beta, \theta)$ defined in \eqref{eq:dgap}.
For every $(\beta, \theta) \in \mathbb{R}^p \times \mathbb{R}^n$, we
have
\begin{eqnarray}
\delta_{\lambda}(\beta, \theta) = \begin{cases}
\frac{\lambda^2}{2}\left\|\theta -
(y-X\beta)/\lambda\right\|^2_2 + \lambda\left(\|\beta\|_1 -
\theta^TX\beta\right), &\mbox{if } \theta \in \mathcal{P}\\
+\infty, &\mbox{otherwise}.
\end{cases}
\label{eq:dgap_formula}
\end{eqnarray}
Expand the formula in \eqref{eq:dgap}, and then complete the square
w.r.t $\theta$.
Let $(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}$
be a feasible primal-dual pair. If $\theta_{\lambda}^*$ is an optimal
dual point (i.e maximizes the dual objective $p_{\lambda}$ defined in
\eqref{eq:dual}), then for every other dual point $\theta \in
\mathbb{R}^n$ it holds that
\begin{eqnarray}
\left\|\theta_{\lambda}^* -
(y-X\beta)/\lambda\right\|_2 \le
\sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda.
\label{eq:dual_sphere}
\end{eqnarray}
By the optimality of $\theta_{\lambda}^*$ for the ``marginal''
duality-gap function $\eta \mapsto \delta_{\lambda}(\beta, \eta)$, we
have
\begin{eqnarray}
\delta_{\lambda}(\beta, \theta_{\lambda}^*) \le
\delta_{\lambda}(\beta, \theta).
\label{eq:dgap_ineq}
\end{eqnarray}
Now observe that $\theta_{\lambda}^*$, being the projection of the point
$y/\lambda$ onto the dual-feasible polyhedron $\mathcal{P}$, lies on
the boundary of $\mathcal{P}$ because the latter doesn't contain
$y/\lambda$ since $\lambda < \|X^Ty\|_{\infty}$. Thus we have
$\|X^T\theta_{\lambda}^*\|_{\infty} = 1$, and by Holder's inequality
it follows that
\begin{eqnarray}
\|\beta\|_1 - \beta^TX^T\theta_{\lambda}^* =
\|\beta\|_1\|X^T\theta_{\lambda}^*\|_{\infty} -
\beta^TX^T\theta_{\lambda}^* \ge 0.
\label{eq:holder}
\end{eqnarray}
Finally, invoking formula \eqref{eq:dgap_formula} on the LHS of
\eqref{eq:dgap_ineq} and using \eqref{eq:holder} completes the proof.
Given a feasible primal-dual pair
$(\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}$, the above theorem prescribes a trust-region for the optimal dual point
$\theta_{\lambda}^*$, namely the sphere $S_n(y - X\beta)/\lambda,
\sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda)$.
Static safe sphere test
Let's begin with the following elementary but important lemma about the maximum value of the function $\theta \mapsto |b^T\theta|$ on the sphere \begin{eqnarray} S_n(c,r) := \{\theta \in \mathbb{R}^n | \|\theta - c\|_2 \le r\}, \end{eqnarray} of center $c \in \mathbb{R}^n$ and radius $r > 0$. Viz,
\begin{eqnarray}
\underset{\theta \in S_n(c, r)}{max}\text{ }|b^T\theta| = |b^Tc| +
r\|b\|_2.
\label{eq:max_sphere}
\end{eqnarray}
One has
\begin{eqnarray*}
\underset{\theta \in S_n(c, r)}{max}\text{ }b^T\theta =
\underset{\theta \in S_n(0, r)}{max}\text{ }b^T(\theta + c) = b^Tc +
r\underset{\theta \in S_n(0, 1)}{max}\text{ }b^T\theta = b^Tc +
r\|b\|_2.
\end{eqnarray*}
Replacing $b$ with $-b$ in the above equation yields
\begin{eqnarray*}
\underset{\theta \in S_n(c, r)}{max}-b^T\theta = -b^Tc +
r\|b\|_2.
\end{eqnarray*}
Now, combining both equations and using the fact that
\begin{eqnarray*}
|b^T\theta| \equiv max(b^T\theta, -b^T\theta),
\end{eqnarray*}
we obtain the desired result.
The following lemma is from [Xiang et al. 2014].
Given a safe sphere $S_n(c, r)$ for the optimal dual point
$\theta_{\lambda}^*$, the following screening rule is safe
\begin{eqnarray}
\text{Discard the } j\text{th feature if }
|X_j^Tc| + r\|X_j^T\|_2 < 1.
\label{eq:sphere_test}
\end{eqnarray}
Direct application of \eqref{eq:fundamental} and
\eqref{eq:max_sphere}.
Using the trust-region \eqref{eq:dual_sphere} established in the previous theorem for the optimal dual variable
$\theta_{\lambda}^*$, namely the sphere $S_n\left((y-X\beta)/\lambda,
\sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right)$, we can
envisage to device a screening rule in the form
\eqref{eq:sphere_test}. Indeed,
For any primal-dual pair $(\beta,\theta) \in \mathbb{R}^p \times
\mathbb{R}^n$, the rule
\begin{eqnarray}
\text{Discard the } j\text{th feature if }
\left|X_j^T(y-X\beta)\right| +
\sqrt{2\delta_{\lambda}(\beta,\theta)}\|X_j^T\|_2 < \lambda
\label{eq:sphere_test_actual}
\end{eqnarray}
is safe.
Use the last lemma, on the safe sphere
$S_n\left((y-X\beta)/\lambda,
\sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right)$ obtained in
the previous theorem.
Dynamic safe sphere test
The results presented here are due to very recent work by Gramfort and co-workers. The screening rule in \eqref{eq:sphere_test_actual} only makes sense (i.e there is any hope it could ever screen some features) only if $\delta_{\lambda}(\beta,\theta) < +\infty$, i.e only if $\theta$ is dual-feasible. In fact, the smaller the duality-gap $\delta_{\lambda}(\beta,\theta)$, the more effective the screening rule is. Thus we need a procedure which, given a primal point $\beta \in \mathbb{R}^p$, generates a dual-feasible point $\theta$ for which $\delta_{\lambda}(\beta,\theta)$ is as small as possible. As was first mentioned in [Bonnefoy et al. 2014], any iterative solver for \eqref{eq:primal} can be used to produce a sequence of primal-dual feasible pairs $(\beta^{(k)}, \theta^{(k)}) \in \mathbb{R}^p \times \mathcal{P}$, and a decreasing sequence of safe spheres $S_n\left(y / \lambda, \|\theta^{(k)} - y / \lambda\right\|_2)$. Indeed for each primal iterate $\beta^{(k)}$, one finds a scalar $\mu^{(k)} \in \mathbb{R}$ such that $\mu^{(k)} (y - X\beta^{(k)}) \in \mathcal{P}$ is the dual-feasible point closest to $y / \lambda$. This sub-problem is a simple minimization problem of quadratic function $\mu \mapsto \|\mu(y-X\beta^{(k)}) - y / \lambda\|^2_2$ on a closed real interval $\left[-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}, \frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}\right]$, with an analytic solution \begin{eqnarray} \mu^{(k)} = \begin{cases} \left[\frac{y^T(y-X\beta^{(k)})}{\lambda \|y-X\beta^{(k)}\|_2^2}\right]^{-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}_{\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}, &\mbox{if } X\beta^{(k)} \ne y\\ 1, &\mbox{otherwise}. \end{cases} \end{eqnarray} The resulting algorithm ("Poorman's FISTA with dynamic screening") is depicted below.- Input: $\lambda \in \text{ }]0, \|X^Ty\|_{\infty}[$ the regularization parameter; $\epsilon > 0$ the desired precision on duality gap.\\
- Initialize: $\beta^{(0)} \leftarrow 0 \in \mathbb{R}^p$, $\theta^{(0)} \leftarrow y/\lambda \in \mathcal{P}$, $\delta^{(0)} \leftarrow+\infty$, $t^{(0)} \leftarrow 1$, and $k \leftarrow 0$.
- Repeat (until $\delta^{(k)} < \epsilon$):
- $\beta^{(k + 1)} \leftarrow soft_{\lambda/L}(\eta^{(k)} - X^T(X\eta^{(k)} - y)), \hspace{.5em}\theta^{(k+1)} \leftarrow \mu^{(k+1)}(y - X\beta^{(k+1)})$
- $t^{(k+1)} \leftarrow \frac{1 + \sqrt{4t^{(k)} + 1}}{2}, \hspace{.5em}\eta^{(k+1)} \leftarrow \beta^{(k+1)} + \frac{t^{(k)} - 1}{t^{(k+1)}}(\beta^{(k+1)} - \beta^{(k)})$
- $\delta^{(k+1)} \leftarrow \frac{\lambda^2}{2}\left\|\theta^{(k+1)} - (y-X\beta^{(k+1)})/\lambda\right\|^2_2 + \lambda\left(\|\beta^{(k+1)}\|_1 - \theta^TX\beta^{(k + 1)}\right)$
- $X, \beta^{(k+1)}, \eta^{(k+1)} \leftarrow screen(X,y,\beta^{(k+1)}, \eta^{(k+1)}, \delta^{(k+1)})$
- $k \leftarrow k + 1$
- Return $\beta^{(k)}$
- One can show that the above dynamic screening procedure those no harm to the convergence theory the iterative solver.
- A coordinate-descent solver would be more appropriate. We presented FISTA here for simplicity rather than applicability
No comments:
Post a Comment