Processing math: 100%

Monday, June 29, 2015

The future of screening

Feature-screening à la [Dohmatob et al. PRNI2015]
In machine-learning, feature-screening aims at detecting and eliminating irrelevant (non-predictive) features thus reducing the size of the underly- ing optimization problem (here problem \eqref{eq:primal}). The general idea is to compute for each value of the regularization parameter, a relevance measure for each feature, which is then compared with a threshold (produced by the screening procedure itself). Features which fall short of this threshold are detected as irrelevant and eliminated. This post presents an overview of so-called dynamic screening rules, a new generation of safe screening rules for the Lasso (and related models like the Group-Lasso) which have appeared recently in the literature (for example, [Bonnefoy et al. 2014]). A particular emphasis is put on the novel duality-gap-based screening rules due to Gramfort and co-authors.


We also present resent work on heuristc univariate screening [Dohmatob et al. PRNI2015].

Notation

For a vector v \in \mathbb{R}^p, we recall the definition of its
  • l_1-norm \|v\|_1 := \sum_{1 \le j \le p}|v_j|,
  • l_2-norm \|v\|_2 := (\sum_{1 \le j \le p}|v_j|^2)^{1/2}, and
  • l_{\infty}-norm \|v\|_{\infty} := \max_{1 \le j \le p}|v_j|.
The transpose of a matrix A \in \mathbb{R}^{n \times p} is denoted A^T. The ith row of A is denoted A_i. The jth column of A is the jth row of A^T, namely A^T_j. Finally, define the bracket \begin{equation} \left[a\right]_c^b := min(max(a, b), c) \end{equation}

General considerations

Consider a Lasso model with response vector y \in \mathbb{R}^p and design matrix X \in \mathbb{R}^{n \times p}. Let \lambda (with 0 < \lambda < \|X^Ty\|_{\infty}) be the regularization parameter. The primal objective to be minimized as a function of the primal variable \beta \in \mathbb{R}^p (\beta is the vector of regressor coefficients) is \begin{eqnarray} p_{\lambda}(\beta) := \frac{1}{2}\|X\beta - y\|^2_2 + \lambda\|\beta\|_1 \label{eq:primal} \end{eqnarray} In the sense of Fenchel-Rockafellar, the dual objective to be maximized as a function of the dual variable \theta \in \mathbb{R}^n is \begin{eqnarray} d_{\lambda}(\theta) := \begin{cases} \frac{1}{2}\|y\|^2_2 - \frac{\lambda^2}{2}\|\theta - \frac{y}{\lambda}\|^2_2, &\mbox{if } \|X^T\theta\|_{\infty} \le 1\\ -\infty, &\mbox{otherwise}. \end{cases} \label{eq:dual} \end{eqnarray} Finally, the duality-gap \delta_{\lambda}(\beta, \theta) at (\beta, \theta) is defined by \begin{eqnarray} \delta_{\lambda}(\beta, \theta) := p_{\lambda}(\beta) - d_{\lambda}(\theta). \label{eq:dgap} \end{eqnarray} One notes the following straightforward facts
  • \delta_{\lambda}(\beta, \theta) \ge 0 with equality iff \beta minimizes \eqref{eq:primal} and \theta maximizes \eqref{eq:dual}. Such a primal-dual pair is called an optimal pair.


  • The dual objective d_{\lambda} defined in \eqref{eq:dual} has a unique minimizer \theta_{\lambda}^* which corresponds to the euclidean projection of y/\lambda onto the dual-feasible polyhedron \begin{eqnarray} \mathcal{P} := \{\theta \in \mathbb{R}^n | \|X^T\theta\|_{\infty} \le 1\}. \end{eqnarray} Note that \mathcal{P} is compact and convex.
  • Given an optimal primal-dual pair (\beta^*_{\lambda}, \theta^*_{\lambda}) \in \mathbb{R}^p \times \mathcal{P}, we have the fundamental safe (i.e screening rules which provably can't mistakenly discard active features.) screening rule \begin{eqnarray} |X^T_j\theta^*_{\lambda}| < 1 \implies \beta_{\lambda,j}^* = 0\text{ (i.e the $j$th feature is irrelevant)}. \label{eq:fundamental} \end{eqnarray}
  • The inequality \eqref{eq:fundamental} allows the possibility to envisage constructing safe screening rules as follows:
    • Construct a ``small'' compact set C \subseteq \mathbb{R}^n, containing the dual optimal point \theta_{\lambda}^* with C \cap \mathcal{P} \ne \emptyset, such that the value maximum value m_{C,j} := \underset{\theta \in C}{max}\text{ }|X^T_j\theta| can be easily computed.
    • Noting that m_{C,j} is an upper bound for |X^T_j\theta^*_{\lambda}| in \eqref{eq:fundamental}, we may then discard all features j for which m_{C,j} < 1.
    The rest of this manuscript overviews methods for effectively realizing such a construction.

    Safe sphere tests

    We start with the following lemma which provides a useful formula for the duality-gap \delta_{\lambda}(\beta, \theta) defined in \eqref{eq:dgap}.
    For every (\beta, \theta) \in \mathbb{R}^p \times \mathbb{R}^n, we have \begin{eqnarray} \delta_{\lambda}(\beta, \theta) = \begin{cases} \frac{\lambda^2}{2}\left\|\theta - (y-X\beta)/\lambda\right\|^2_2 + \lambda\left(\|\beta\|_1 - \theta^TX\beta\right), &\mbox{if } \theta \in \mathcal{P}\\ +\infty, &\mbox{otherwise}. \end{cases} \label{eq:dgap_formula} \end{eqnarray}
    Expand the formula in \eqref{eq:dgap}, and then complete the square w.r.t \theta.
    Let (\beta, \theta) \in \mathbb{R}^p \times \mathcal{P} be a feasible primal-dual pair. If \theta_{\lambda}^* is an optimal dual point (i.e maximizes the dual objective p_{\lambda} defined in \eqref{eq:dual}), then for every other dual point \theta \in \mathbb{R}^n it holds that \begin{eqnarray} \left\|\theta_{\lambda}^* - (y-X\beta)/\lambda\right\|_2 \le \sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda. \label{eq:dual_sphere} \end{eqnarray}
    By the optimality of \theta_{\lambda}^* for the ``marginal'' duality-gap function \eta \mapsto \delta_{\lambda}(\beta, \eta), we have \begin{eqnarray} \delta_{\lambda}(\beta, \theta_{\lambda}^*) \le \delta_{\lambda}(\beta, \theta). \label{eq:dgap_ineq} \end{eqnarray} Now observe that \theta_{\lambda}^*, being the projection of the point y/\lambda onto the dual-feasible polyhedron \mathcal{P}, lies on the boundary of \mathcal{P} because the latter doesn't contain y/\lambda since \lambda < \|X^Ty\|_{\infty}. Thus we have \|X^T\theta_{\lambda}^*\|_{\infty} = 1, and by Holder's inequality it follows that \begin{eqnarray} \|\beta\|_1 - \beta^TX^T\theta_{\lambda}^* = \|\beta\|_1\|X^T\theta_{\lambda}^*\|_{\infty} - \beta^TX^T\theta_{\lambda}^* \ge 0. \label{eq:holder} \end{eqnarray} Finally, invoking formula \eqref{eq:dgap_formula} on the LHS of \eqref{eq:dgap_ineq} and using \eqref{eq:holder} completes the proof.
    Given a feasible primal-dual pair (\beta, \theta) \in \mathbb{R}^p \times \mathcal{P}, the above theorem prescribes a trust-region for the optimal dual point \theta_{\lambda}^*, namely the sphere S_n(y - X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta, \theta)}/\lambda).

    Static safe sphere test

    Let's begin with the following elementary but important lemma about the maximum value of the function \theta \mapsto |b^T\theta| on the sphere \begin{eqnarray} S_n(c,r) := \{\theta \in \mathbb{R}^n | \|\theta - c\|_2 \le r\}, \end{eqnarray} of center c \in \mathbb{R}^n and radius r > 0. Viz,
    \begin{eqnarray} \underset{\theta \in S_n(c, r)}{max}\text{ }|b^T\theta| = |b^Tc| + r\|b\|_2. \label{eq:max_sphere} \end{eqnarray}
    One has \begin{eqnarray*} \underset{\theta \in S_n(c, r)}{max}\text{ }b^T\theta = \underset{\theta \in S_n(0, r)}{max}\text{ }b^T(\theta + c) = b^Tc + r\underset{\theta \in S_n(0, 1)}{max}\text{ }b^T\theta = b^Tc + r\|b\|_2. \end{eqnarray*} Replacing b with -b in the above equation yields \begin{eqnarray*} \underset{\theta \in S_n(c, r)}{max}-b^T\theta = -b^Tc + r\|b\|_2. \end{eqnarray*} Now, combining both equations and using the fact that \begin{eqnarray*} |b^T\theta| \equiv max(b^T\theta, -b^T\theta), \end{eqnarray*} we obtain the desired result.
    The following lemma is from [Xiang et al. 2014].
    Given a safe sphere S_n(c, r) for the optimal dual point \theta_{\lambda}^*, the following screening rule is safe \begin{eqnarray} \text{Discard the } j\text{th feature if } |X_j^Tc| + r\|X_j^T\|_2 < 1. \label{eq:sphere_test} \end{eqnarray}
    Direct application of \eqref{eq:fundamental} and \eqref{eq:max_sphere}.
    Using the trust-region \eqref{eq:dual_sphere} established in the previous theorem for the optimal dual variable \theta_{\lambda}^*, namely the sphere S_n\left((y-X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right), we can envisage to device a screening rule in the form \eqref{eq:sphere_test}. Indeed,
    For any primal-dual pair (\beta,\theta) \in \mathbb{R}^p \times \mathbb{R}^n, the rule \begin{eqnarray} \text{Discard the } j\text{th feature if } \left|X_j^T(y-X\beta)\right| + \sqrt{2\delta_{\lambda}(\beta,\theta)}\|X_j^T\|_2 < \lambda \label{eq:sphere_test_actual} \end{eqnarray} is safe.
    Use the last lemma, on the safe sphere S_n\left((y-X\beta)/\lambda, \sqrt{2\delta_{\lambda}(\beta,\theta)}/\lambda\right) obtained in the previous theorem.

    Dynamic safe sphere test

    The results presented here are due to very recent work by Gramfort and co-workers. The screening rule in \eqref{eq:sphere_test_actual} only makes sense (i.e there is any hope it could ever screen some features) only if \delta_{\lambda}(\beta,\theta) < +\infty, i.e only if \theta is dual-feasible. In fact, the smaller the duality-gap \delta_{\lambda}(\beta,\theta), the more effective the screening rule is. Thus we need a procedure which, given a primal point \beta \in \mathbb{R}^p, generates a dual-feasible point \theta for which \delta_{\lambda}(\beta,\theta) is as small as possible. As was first mentioned in [Bonnefoy et al. 2014], any iterative solver for \eqref{eq:primal} can be used to produce a sequence of primal-dual feasible pairs (\beta^{(k)}, \theta^{(k)}) \in \mathbb{R}^p \times \mathcal{P}, and a decreasing sequence of safe spheres S_n\left(y / \lambda, \|\theta^{(k)} - y / \lambda\right\|_2). Indeed for each primal iterate \beta^{(k)}, one finds a scalar \mu^{(k)} \in \mathbb{R} such that \mu^{(k)} (y - X\beta^{(k)}) \in \mathcal{P} is the dual-feasible point closest to y / \lambda. This sub-problem is a simple minimization problem of quadratic function \mu \mapsto \|\mu(y-X\beta^{(k)}) - y / \lambda\|^2_2 on a closed real interval \left[-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}, \frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}\right], with an analytic solution \begin{eqnarray} \mu^{(k)} = \begin{cases} \left[\frac{y^T(y-X\beta^{(k)})}{\lambda \|y-X\beta^{(k)}\|_2^2}\right]^{-\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}_{\frac{1}{\|X^T(y - X\beta^{(k)})\|_{\infty}}}, &\mbox{if } X\beta^{(k)} \ne y\\ 1, &\mbox{otherwise}. \end{cases} \end{eqnarray} The resulting algorithm ("Poorman's FISTA with dynamic screening") is depicted below.
    • Input: \lambda \in \text{ }]0, \|X^Ty\|_{\infty}[ the regularization parameter; \epsilon > 0 the desired precision on duality gap.\\
    • Initialize: \beta^{(0)} \leftarrow 0 \in \mathbb{R}^p, \theta^{(0)} \leftarrow y/\lambda \in \mathcal{P}, \delta^{(0)} \leftarrow+\infty, t^{(0)} \leftarrow 1, and k \leftarrow 0.
    • Repeat (until \delta^{(k)} < \epsilon):
      • \beta^{(k + 1)} \leftarrow soft_{\lambda/L}(\eta^{(k)} - X^T(X\eta^{(k)} - y)), \hspace{.5em}\theta^{(k+1)} \leftarrow \mu^{(k+1)}(y - X\beta^{(k+1)})
      • t^{(k+1)} \leftarrow \frac{1 + \sqrt{4t^{(k)} + 1}}{2}, \hspace{.5em}\eta^{(k+1)} \leftarrow \beta^{(k+1)} + \frac{t^{(k)} - 1}{t^{(k+1)}}(\beta^{(k+1)} - \beta^{(k)})
      • \delta^{(k+1)} \leftarrow \frac{\lambda^2}{2}\left\|\theta^{(k+1)} - (y-X\beta^{(k+1)})/\lambda\right\|^2_2 + \lambda\left(\|\beta^{(k+1)}\|_1 - \theta^TX\beta^{(k + 1)}\right)
      • X, \beta^{(k+1)}, \eta^{(k+1)} \leftarrow screen(X,y,\beta^{(k+1)}, \eta^{(k+1)}, \delta^{(k+1)})
      • k \leftarrow k + 1
    • Return \beta^{(k)}
    • One can show that the above dynamic screening procedure those no harm to the convergence theory the iterative solver.
    • A coordinate-descent solver would be more appropriate. We presented FISTA here for simplicity rather than applicability

    Univariate (heuristic) screening for brain decoding problems

    In [Dohmatob et al. PRNI2015], we proposed (amongst other tricks) a univariate heuristic for detecting and disregarding irrelevant voxels in brain decoding problem (ssee figure above). This heuristic can result in upto 10-fold speedup over full-brain analysis. Get a quick overview here.

    No comments:

    Post a Comment