2$ for any $n > 7$, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than $C_p$. \[ BIC = \frac{1}{n}(RSS + \log(n) d \hat{\sigma}^2). \] \medskip {\bf Adjusted $R^2$}: \[ \text{Adjusted } R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)} \] where TSS is the total sum of squares. Unlike $C_p$, AIC and BIC, a large value of adjusted $R^2$ indicates a model with a small test error. Maximizing the adjusted $R^2$ is equivalent to minimizing $\frac{RSS}{n-d-1}$. \bigskip \noindent {\bf Direct Method: Validation and Cross-Validation} \medskip These procedures have an advantage relative to AIC, BIC, $C_p$, and adjusted $R^2$, in that they provide a direct estimate of the test error, and {\it doesn't require an estimate of the error variance $\sigma^2$}. They can also be used in a wide range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom or hard to estimate the error variance $\sigma^2$. In this setting, we can select a model using the {\it one-standard-error rule}. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. \subsection{Subset Selection} {\bf Best Subset Selection}: beside linear regression, the same ideas also apply to other types of models, such as logistic regression. 1. Let ${\cal M}_0$ denote the {\it null model}, which contains no predictors. This model simply predicts the sample mean for each observation. 2. For $k=1, 2, \cdots, p$: \hspace{0.1in} (a) Fit all $\left(\begin{matrix} p \\ k \end{matrix}\right)$ models that contain exactly $k$ predictors. \hspace{0.1in} (b) Pick the best among these $\left(\begin{matrix} p \\ k \end{matrix}\right)$ models, and call it ${\cal M}_k$. Here {\it best} is defined as having the smallest $RSS$, or equivalently largest $R^2$. 3. Select a single best model from among ${\cal M}_0$, $\cdots$, ${\cal M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$. For computational reasons, best subset selection cannot be applied with very large $p$ (with $2^p$ subsets to screen). Best subset selection may also suffer from {\it overfitting} and high variance of the coefficient estimates when the search space is large. For both of these reasons, {\it stepwise} methods are attractive alternatives. \medskip {\bf Forward Stepwise Selection}: 1. Let ${\cal M}_0$ denote the {\it null} model, which contains no predictors. 2. For $k=0, \cdots, p-1$: \hspace{0.1in} (a) Consider all $p-k$ models that augment the predictors in ${\cal M}_k$ with one additional predictor. \hspace{0.1in} (b) Choose the {\it best} among these $p-k$ models, and call it ${\cal M}_{k+1}$. Here {\it best} is defined as having the smallest RSS or highest $R^2$. 3. Select a single best model from among ${\cal M}_0$, $\cdots$, ${\cal M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$. \medskip {\bf Backward Stepwise Selection}: different from forward stepwise selection, backward selection requires that the number of samples $n$ is larger than the number of variables $p$ (so that the full model can be fit). 1. Let ${\cal M}_p$ denote the {\it full} model, which contains all $p$ predictors. 2. For $k= p, p-1, \cdots, 1$: \hspace{0.1in} (a) Consider all $k$ models that contain all but one of the predictors in ${\cal M}_k$, for a total of $k-1$ predictors. \hspace{0.1in} (b) Choose the {\it best} among these $k$ models, and call it ${\cal M}_{k-1}$. Here {\it best} is defined as having smallest RSS or highest $R^2$. 3. Select a single best model from among ${\cal M}_0, \cdots, {\cal M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$. \subsection{Shrinkage} {\bf Ridge Regression}. Recall that the least squares fitting procedure estimates $\beta_0$, $\beta_1$, $\cdots$, $\beta_p$ using the values that minimize \[ RSS = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2. \] In contrast, the ridge regression coefficient estimates $\hat{\beta}^R$ are the values that minimize \[ \sum_{i=1}^n \left(y_i -\beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 = RSS + \lambda \sum_{j=1}^p \beta_j^2, \] where $\lambda \ge 0$ is a {\it tuning parameter}, to be determined separately. The intuition is that the second term, $\lambda\sum_j \beta_j^2$, called a {\it shrinkage penalty}, is small when $\beta_1$, $\cdots$, $\beta_p$ are close to zero, and so it has the effect of {\it shrinking} the estimates of $\beta_j$ towards zero. The tuning parameter $\lambda$ serves to control the relative impact of these two terms on the regression coefficient estimates. Selecting a good value for $\lambda$ is critical; cross-validation is used for this. The standard least squares coefficient estimates are {\it scale equivariant}. In other words, regardless of how the $j$th predictor is scaled, $X_j\hat{\beta}_j$ will remain the same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function. Therefore, it is best to apply ridge regression after {\it standardizing the predictors}, using the formula \[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}}. \] \medskip {\bf Lasso}. Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all $p$ predictors in the final model. The {\it Lasso} is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, $\hat{\beta}_{\lambda}^L$, minimize the quantity \[ \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p |\beta_j| = RSS + \lambda \sum_{j=1}^p |\beta_j|. \] As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter $\lambda$ is sufficiently large. Hence, much like best subset selection, the lasso performs {\it variable selection}. As in ridge regression, selecting a good value of $\lambda$ for the lasso is critical; cross-validation is again the method of choice. \medskip {\it Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?} One can show that the lasso and ridge regression coefficient estimates solve the problems \[ \min_{\beta} \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \text{ subject to } \sum_{j=1}^p |\beta_j| \le s \] and \[ \min_{\beta} \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \text{ subject to } \sum_{j=1}^p \beta_j^2 \le s \] respectively. As a general result of convex optimization, the minimal values occur on extreme points, and the extreme points of the set $\{(\beta_1, \cdots, \beta_p): \sum_{j=1}^p |\beta_j| \le s \}$ are those points on axis. \medskip {\bf Conclusion}. Neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors. However, the number of predictors that is related to the response is never known {\it a priori} for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set. To select for ridge regression and lasso the tuning parameter $\lambda$ or equivalently, the value of the constraint $s$, we can use cross-validation. We choose a grid of $\lambda$ values, and compute the cross-validation error rate for each value of $\lambda$. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. \subsection{Dimension Reduction} Let $Z_1$, $Z_2$, $\cdots$, $Z_M$ represent $M

p$ dimensional space. This will fit a support-vector classifier in the enlarged space and results in non-linear decision boundaries in the original space. \bigskip {\bf Support vector machines}. Polynomials (especially high-dimensional ones) get wild rather fast. There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers -- through the use of kernels. It can be shown that $\bullet$ The linear support vector classifier can be represented as \[ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_i\rangle, \] where there are $n$ parameters $\alpha_i$, $i=1, \cdots, n$, one per training observation. $\bullet$ To estimate the parameters $\alpha_1$, $\cdots$, $\alpha_n$ and $\beta_0$, all we need are the $\left( \begin{matrix} n \\ 2 \end{matrix} \right)$ inner products $\langle x_i, x_i' \rangle$ between all pairs of training observations. It turns out that most of the $\hat{\alpha}_i$ can be zero: \[ f(x) = \beta_0 + \sum_{i \in {\cal S}} \hat{\alpha}_i \langle x, x_i \rangle \] where ${\cal S}$ is the {\it support set} of indices $i$ such that $\hat{\alpha}_i > 0$. If we can compute inner-products between observations, we can fit a SV classifier. Some special kernel functions can do this for us. E.g. \[ K(x_i, x_{i'}) = \left( 1 + \sum_{j=1}^p x_{ij} x_{i'j} \right)^d \] computes the inner product needed for $d$ dimensional polynomials -- $\left(\begin{matrix} p+d \\ d\end{matrix}\right)$ basis functions. The solution has the form \[ f(x) = \beta_0 + \sum_{i \in {\cal S}} \hat{\alpha}_i K(x, x_i). \] \bigskip {\bf SVMs with more than two classes}. The SVM as defined works for $K = 2$ classes. What do we do if we have $K > 2$ classes? $\bullet$ One versus All (OVA). Fit $K$ different 2-class SVM classifiers $\hat{f}_k(x)$, $k = 1, \cdots, K$; each class versus the rest. Classify $x^*$ to the class for which $\hat{f}^k(x^*)$ is largest. $\bullet$ One versus One (OVO). Fit all $\left(\begin{matrix} K \\ 2 \end{matrix}\right)$ pairwise classifiers $\hat{f}_{kl}(x)$. Classify $x^*$ to the class that wins the most pairwise competitions. Which to choose? If $K$ is not too large, use OVO. \bigskip {\bf Relationship to logistic regression}. With $f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$ can rephrase support-vector classifier optimization as \[ \min_{\beta_0, \beta_1, \cdots, \beta_p} \left\{\sum_{i=1}^n \max [0, 1-y_i f(x_i)] + \lambda \sum_{j=1}^p \beta_j^2 \right\}. \] This has the form of {\it loss plus penalty}. The loss is known as the {\it hinge loss}. Very similar to ``loss" in logistic regression (negative log-likelihood). Which to use: SVM or Logistic Regression? $\bullet$ When classes are (nearly) separable, SVM does better than LR. So does LDA. $\bullet$ When not, LR (with ridge penalty) and SVM very similar. $\bullet$ If you wish to estimate probabilities, LR is the choice. $\bullet$ For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive. \section{Unsupervised Learning} In {\it supervised learning methods} such as regression and classification, we observe both a set of features $X_1$, $X_2$, $\cdots$, $X_p$ for each object, as well as a response or outcome variable $Y$. The goal is then to predict $Y$ using $X_1$, $X_2$, $\cdots$, $X_p$. In {\it unsupervised learning}, we observe only the features $X_1$, $X_2$, $\cdots$, $X_p$. We are not interested in prediction, because we do not have an associated response variable $Y$. The goal is to discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations? We discuss two methods: $\bullet$ {\it principal components analysis}, a tool used for data visualization or data pre-processing before supervised techniques are applied, and $\bullet$ {\it clustering}, a broad class of methods for discovering unknown subgroups in data. \bigskip {\bf Principal components analysis} (PCA). PCA produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated. Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization. $\bullet$ The first principal component of a set of features $X_1$, $X_2$, $\cdots$, $X_p$ is the normalized linear combination of the features \[ Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \cdots + \phi_{p1} X_p \] that has the largest variance. By {\it normalized}, we mean that $\sum_{j=1}^p \phi^2_{j1}=1$. We refer to the elements $\phi_{11}$, $\phi_{21}$, $\cdots$, $\phi_{p1}$ as the loadings of the first principal component; together, the loadings make up the principal component loading vector, $\phi_1 = (\phi_{11}, \phi_{21}, \cdots, \phi_{p1})^T$. To compute principal components, suppose we have an $n\times p$ data set ${\bf X}$. We assume that each of the variables in ${\bf X}$ has been centered to have mean zero. We then look for the linear combination of the sample feature values of the form \[ z_{i1} = \phi_{11} x_{i1} + \phi_{21} x_{i2} + \cdots + \phi_{p1} x_{ip} \] for $i = 1, \cdots, n$ that has largest sample variance, subject to the constraint that $\sum_{j=1}^p \phi^2_{j1} = 1$. Since each of the $x_{ij}$ has mean zero, then so does $z_{i1}$ (for any values of $\phi_{j1}$). Hence the sample variance of the $z_{i1}$ can be written as $\frac{1}{n}\sum_{i=1}^n z^2_{i1}$. Plug in the first principal component loading vector solves the optimization problem \[ \max_{\phi_{11}, \cdots, \phi_{p1}} \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right)^2 \text{ subject to } \sum_{j=1}^p \phi^2_{j1} = 1. \] The problem can be solved via a singular value decomposition of the matrix $X$. We refer to $Z_1$ as the first principal component, with realized values $z_{11}$, $\cdots$, $z_{n1}$. The loading vector $\phi_1$ with elements $\phi_{11}$, $\phi_{21}$, $\cdots$, $\phi_{p1}$ defines a direction in feature space along which the data vary the most. If we project the $n$ data points $x_1$, $\cdots$, $x_n$ onto this direction, the projected values are the principal component scores $z_{11}$, $\cdots$, $z_{n1}$ themselves. $\bullet$ The second principal component is the linear combination of $X_1$, $\cdots$, $X_p$ that has maximal variance among all linear combinations that are uncorrelated with $Z_1$. The second principal component scores $z_{12}$, $z_{22}$, $\cdots$, $z_{n2}$ take the form \[ z_{i2} = \phi_{12} x_{i1} + \phi_{22} x_{i2} + \cdots + \phi_{p2} x_{ip}, \] where $\phi_2$ is the second principal component loading vector, with elements $\phi_{12}$, $\phi_{22}$, $\cdots$, $\phi_{p2}$. $\bullet$ It turns out that constraining $Z_2$ to be uncorrelated with $Z_1$ is equivalent to constraining the direction $\phi_2$ to be orthogonal (perpendicular) to the direction $\phi_1$. And so on. $\bullet$ The principal component directions $\phi_1$, $\phi_2$, $\phi_3$, $\cdots$ are the ordered sequence of right singular vectors of the matrix ${\bf X}$, and the variances of the components are $\frac{1}{n}$ times the squares of the singular values. There are at most $\min(n-1, p)$ principal components. \bigskip {\bf Clustering methods}. Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. We seek a partition of the data into distinct groups so that the observations within each group are quite similar to each other. To make this concrete, we must define what it means for two or more observations to be similar or different. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied. There are two clustering methods: $\bullet$ In {\it K-means clustering}, we seek to partition the observations into a pre-specified number of clusters. $\bullet$ In {\it hierarchical clustering}, we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a {\it dendrogram}, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to $n$. \medskip {\it K-means clustering}. The idea behind $K$-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible. The within-cluster variation for cluster $C_k$ is a measure $WCV(C_k)$ of the amount by which the observations within a cluster differ from each other, where \[ WCV(C_k) = \frac{1}{|C_k|} \sum_{i, i'\in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 \] with $|C_k|$ denoting the number of observations in the $k$th cluster. Hence we want to solve the problem \[ \min_{C_1, \cdots, C_k} \left\{\sum_{k=1}^K WCV(C_k) \right\}. \] The $K$-means clustering algorithm is: $\bullet$ 1. Randomly assign a number, from 1 to $K$, to each of the observations. These serve as initial cluster assignments for the observations. $\bullet$ 2. Iterate until the cluster assignments stop changing: \hspace{.2in} 2.1 For each of the $K$ clusters, compute the cluster centroid. The $k$th cluster centroid is the vector of the $p$ feature means for the observations in the $k$th cluster. \hspace{.2in } 2.2 Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). This algorithm is guaranteed to decrease the value of the objective at each step, because \[ \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 = 2 \sum_{i\in C_k} \sum_{j=1}^p (x_{ij} - \bar{x}_{kj})^2 \] where $\bar{x}_{kj} = \frac{1}{|C_k|} \sum_{i \in C_k} x_{ij}$ is the mean for feature $j$ in cluster $C_k$. \medskip {\it Hierarchical clustering}. $K$-means clustering requires us to pre-specify the number of clusters K. This can be a disadvantage. Hierarchical clustering is an alternative approach which does not require that we commit to a particular choice of $K$.In this section, we describe {\it bottom-up} or {\it agglomerative} clustering. This is the most common type of hierarchical clustering, and refers to the fact that a dendrogram is built starting from the leaves and combining clusters up to the trunk. Hierarchical clustering algorithm in words: $\bullet$ Start with each point in its own cluster. $\bullet$ Identify the closest two clusters and merge them. $\bullet$ Repeat. $\bullet$ Ends when all points are in a single cluster. \begin{thebibliography}{99} \bibitem{Dataschool14} Data School. ``In-depth introduction to machine learning in 15 hours of expert videos". \href{http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/}{http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/}, 2014.9.3. \bibitem{JWHT14} Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. {\it An introduction to statistical learning : with applications in R}. New York, Springer, 2014. \bibitem{Silver15} Nate Silver. {\it The Signal and the Noise: Why So Many Predictions Fail--but Some Don't}. Penguin Books, 2015. \bibitem{Zeng15} Yan Zeng. ``Book summary: {\it Econometrics for dummies}". \href{http://opensrcsolutions.webs.com/}{http://opensrcsolutions.webs.com/}, 2015. \end{thebibliography} \end{document} % Version 1.0, 2016-5-14: first draft.