\documentclass[12pt]{article}

\usepackage{ amsmath, amssymb, graphicx, psfrag, bm }

\addtolength{\textheight}{2.1in}
\addtolength{\topmargin}{-1.25in}
\addtolength{\textwidth}{1.965in}
\addtolength{\oddsidemargin}{-1.0in}
\addtolength{\evensidemargin}{-1.1in}
\setlength{\parskip}{0.1in}
\setlength{\parindent}{0.0in}

\pagestyle{plain}

\raggedbottom
 
\newcommand{\given}{\, | \,}
 
\begin{document}

\begin{flushleft}

Prof.~David Draper \\
Department of Statistics \\
Baskin School of Engineering \\
University of California, Santa Cruz

\end{flushleft}

\Large

\begin{center}

STAT 206 (\textsf{Applied Bayesian Statistics})

\fbox{\textbf{Take-Home Test 3}}

\large

\textbf{Absolute due date}: uploaded to \texttt{canvas.ucsc.edu} by 11.59pm on \textbf{21 Mar 2021}

\end{center}

\normalsize

Here are the ground rules: this test is open-book and open-notes, and
consists of two problems (true/false and calculation). \textbf{Each of the 7
true/false questions is worth 10 points; calculation problem 2(A) is
worth 115 total points (with the possibility of up to 20 extra credit points); 
problem 2(B) is worth 220 points; the total for the test is therefore 405 points, plus the possibility of up to 20 extra-credit points in problem 2(A)}.

The right answer with no reasoning to support it, or incorrect reasoning,
will get \textbf{half credit}, so try to make a serious effort on each part
of each problem (this will ensure you at least half credit). In an AMS graduate class I taught in 2012, on a take-home test like this one there were 15 true/false questions, worth a total of 150 points; one student got a score of 92 out of 150 (61\%, a D$-$, in a graduate class where B$-$ is the lowest passing grade) on that part of the test, for repeatedly answering just ``true" or ``false" with no explanation. Don't let that happen to you.  

On non-extra-credit problems, I mentally start everybody out at $-0$ (i.e.,
with a perfect score), and then you accumulate negative points for
incorrect answers and/or reasoning, or parts of problems left blank. On
extra-credit problems, the usual outcome is that you go forward (in the
sense that your overall score goes up) or you at least stay level, but
please note that it's also possible to go backwards on such problems (e.g.,
if you accumulate $+3$ for part of an extra-credit problem but $-4$ for the
rest of it, for saying or doing something egregiously wrong).

This test is to be entirely your own efforts; do not collaborate with
anyone or get help from anyone but me or our TAs (Peter Trubey and Xingchen (Joe) Yu). The intent is that the course lecture notes and readings should be sufficient to provide you with all the guidance you need to solve the problems posed below, but you may use other written materials (e.g., the web, journal
articles, and books other than those already mentioned in the readings),
\textbf{provided that you cite your sources thoroughly and accurately}; you
will lose (substantial) credit for, e.g., lifting blocks of text directly
from \texttt{wikipedia} and inserting them into your solutions without full
attribution.

If it's clear that (for example) two people have worked together on a part
of a problem that's worth 20 points, and each answer would have earned 16
points if it had not arisen from a collaboration, then each person will
receive 8 of the 16 points collectively earned (for a total score of 8 out
of 20), and I reserve the right to impose additional penalties at my
discretion. If you solve a problem on your own and then share your solution
with anyone else (because people from your cultural background routinely do
this, or out of pity, or kindness, or whatever motive you may believe you
have; it doesn't matter), you're just as guilty of illegal collaboration as
the person who took your solution from you, and both of you will receive
the same penalty. This sort of thing is necessary on behalf of the many
people who do not cheat, to ensure that their scores are meaningfully
earned. In the AMS graduate class in 2012 mentioned above, five people failed the class because of illegal collaboration; don't let that happen to you.

In class I've demonstrated numerical work in \texttt{R}; you can (of course) make the calculations and plots requested in the problems below in any environment you prefer (e.g., \texttt{Matlab}, \texttt{Python}, ...).

\textbf{Please collect \{all of the code you used in answering the questions  below\} into an Appendix at the end of your document, so that (if you do something wrong) the grader can better give you part credit.} To avoid plagiarism, if you end up using any of the code I post on the course web page or generate during office hours, at the beginning of your Appendix you can say something like the following:

\begin{quote}

\textit{I used some of Professor Draper's \texttt{R} code in this assignment, adapting it as needed.}

\end{quote}

\section{True/False}

\textit{[70 total points: 10 points each]} For each statement below, say whether it's true or false; if true without further assumptions, briefly explain why it's true (and --- \textit{extra credit} --- what its implications are for statistical inference); if it's sometimes true, give the extra conditions necessary to make it true; if it's false, briefly explain how to change it so that it's true and/or give an example of why it's false. If the statement consists of two or more sub-statements and two or more of them are false, you need to explicitly address all of the false sub-statements in your answer.

In answering these questions you may find it helpful to consult the following references, available on the course web page: DS (Degroot and Schervish (2012)) sections 3.10, 12.5, 12.6; Gelman et al.~(2014) Chapter 11.

\begin{itemize}

\item[(A)]

If You can figure out how to do IID sampling from the posterior distribution of interest to You, this will often be more Monte-Carlo efficient than MCMC sampling from the same posterior.

\item[(B)]

A (first-order) Markov chain is a particularly simple stochastic process: to simulate where the chain goes next, You only need to know (i) where it is now and (ii) where it was one iteration ago.

\item[(C)]

The bootstrap is a frequentist simulation-based computational method (with ties to Bayesian nonparametrics) that can be used to create approximate confidence intervals for population summaries even when the population distribution of the outcome variable $y$ of interest is not known; for example, if (by exchangeability, implied by the problem context) all You know is that Your observations $\bm{ y } = ( y_1, \dots, y_n )$ are IID from \textit{some} distribution with finite mean $\mu$ and finite SD $\sigma$, You can use the bootstrap to build an approximate confidence interval for $\mu$ even though You don't know what the population distribution is.

\item[(D)]

In MCMC sampling from a posterior distribution, You have to be really careful to use a monitoring period of just the right length, because if the monitoring goes on for too long the Markov chain may drift out of equilibrium.

\item[(E)]

Simulation-based computational methods are needed in Bayesian data science (inference, prediction and decision-making) because conjugate priors don't always exist and high-dimensional probability distributions are difficult to summarize algebraically.

\item[(F)]

In MCMC sampling from a posterior distribution, You have to be really careful to use a burn-in period of just the right length, because if the burn-in goes on for too long the Markov chain will have missed its chance to find the equilibrium distribution.

\item[(G)]

You're MCMC sampling from a posterior distribution for a vector $\bm{ \theta } = ( \theta_1, \dots, \theta_k )$. During the monitoring period, the column in the MCMC data set for a component of $\theta$ ($\theta_j$, say) behaves like an autoregressive time series of order 1 ($AR_\rho ( 1 )$) with estimated first-order autocorrelation $\hat{ \rho }_j = 0.992$. As usual, You'll use the sample mean $\bar{ \theta }_j^*$ of the monitored draws $\theta_{ ij }^*$ as Your Monte Carlo estimate of the posterior mean of $\theta_j$. To achieve the same estimated Monte Carlo standard error for $\bar{ \theta }_j^*$ that You would have been able to attain if You could have done IID sampling, Your MCMC monitoring sample size would have to be about 250 times bigger than the length of the IID monitoring run.

\end{itemize}

\section{Calculation}

Parts (A) and (B) of this problem are similar to problem 2(B) in Take-Home Test 2, in that they look really long but don't actually have that much for You to do: just read the problems carefully, run my code (sometimes You'll need to modify it a bit), and figure how to interpret the output.

\begin{itemize}

\item[(A)]

% talk about burleigh's issue of measurement error in the predictors

\textit{[115 total points for this problem, plus up to 20 extra credit points]}  As I'm sure You know, if You encounter a wild mushroom in a forest there's no guarantee that it's edible; every year several people die in the U.S.~from wild mushroom poisoning. Two questions come to mind, in this age of cell phone apps: (1) Can the edible/poisonous status of a wild mushroom be accurately predicted from characteristics such as its appearance and odor? and (2) If You were building an app to give people advice about whether a wild mushroom they've found is edible, (to make the app easy to use) what's the minimum number of variables necessary to get highly accurate predictions?

The \textit{U.C.~Irvine Machine Learning Repository} has a data set -- a copy of which is now on the \texttt{Drupal} course web page, along with a text file containing important contextual information -- consisting of $n =$ 8,124 

\begin{quote}

hypothetical samples corresponding to 23 species of gilled mushrooms in the \textit{Agaricus} and \textit{Lepiota} Family. Each species is identified as
definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The \textit{Audubon Society Field Guide to North American Mushrooms} (1981) clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

\end{quote}

As You'll see when You begin looking at the data set, there are $k = 22$ predictor variables $( x_1, \dots, x_k )$ available, ranging from aspects of the mushroom's cap to its habitat, and the outcome variable $y$ is coded 1 for poisonous and 0 for edible. The goals of this problem, corresponding to the two questions above, are (1) to build linear regression models, using these predictors, to produce estimated probabilities $\hat{ p }$ that $( y = 1 )$ as a function of a given mushroom's characteristics, (2) to identify the smallest subset of the $x_j$ (for inclusion in the app) that still produces highly accurate $\hat{ p }$ values, and (3) to decide whether the predictive modeling in step (2) is accurate enough to release the app to the general public without poisoning a lot of people in the process.

We're going to do a maximum-likelihood analysis of this data set, because (I'm assuming that) you and I know so little about `gilled mushrooms in the \textit{Agaricus} and \textit{Lepiota} Family' that a Bayesian analysis here would just reproduce the likelihood story. I'm also doing something a bit unusual in having you fit \textit{linear} regression models to a binary outcome variable, but the more usual choice of \textit{logistic} regression models (like those in problem 2(B)) would give essentially the same results.

When You examine the set of predictor variables, You'll see that they're all categorical (\texttt{R} calls such variables \textit{factors}), taking on a number of possible values (\textit{levels}) ranging from 1 to 12. 

\begin{quote}

\textbf{\fbox{\textit{Important:}} \vspace*{0.025in} all of the levels of all of the predictors in the data set have been abbreviated to a single letter in the standard English alphabet; the context file contains a dictionary that translates those abbreviations to their actual meanings.} 

\end{quote}

Obviously any predictor variable that takes on only 1 possible value is useless for predicting anything; there is such a variable, so early on in the analysis we'll drop it (\texttt{veil.type}) and reset $k$ to 21. One variable -- \texttt{stalk.root} -- has a lot of missing values (2,480 out of 8,124), but one nice thing about categorical predictors is that \textit{missingness can be treated as just another level of the factor}, so that no cases are lost by having to omit rows in the data set with missing values in them (that would be an undesirable action that's not needed with factor predictors).

As discussed in class, the basic frequentist (multiple) linear regression model is of the form (for $i = 1, \dots, n )$
\begin{equation} \label{e:linear-regression-1}
y_i = \beta_0 + \sum_{ j = 1 }^k x_{ ij } \, \beta_j + e_i \, ,
\end{equation}
in which the $( e_i \given \sigma \, [SM \! \! : \! \mathbb{ N }] \, \mathcal{ B } )$ are IID $N ( 0, \sigma^2 )$; here $[SM \! \! : \! \mathbb{ N }]$ denotes the Normality assumption for the sampling model (SM), which is not part of problem context. In class we also saw that this model can be written in matrix form as
\begin{equation} \label{e:linear-regression-2}
\bm{ y } = X \, \bm{ \beta } + \bm{ e } \, ,
\end{equation} 
where $\bm{ y }$ is the $( n \times 1 )$ column vector whose transpose is $( y_1, \dots, y_n )$, $X$ is the $[ n \times ( k + 1 ) ]$ matrix whose first column is a vector of 1s (to account for the intercept term $\beta_0$) and whose $i$th row is $( 1, x_{ i1 }, \dots, x_{ ik } )$, $\bm{ \beta }$ is the $[ ( k + 1 ) \times 1 ]$ column vector whose transpose is $( \beta_0, \beta_1, \dots, \beta_k )$ and $\bm{ e }$ is the $( n \times 1 )$ column vector whose transpose is $( e_1, \dots, e_n )$. 

In applying this model to the mushroom data, a new question immediately arises: how can You bring a categorical predictor -- such as the 18th predictor in the data set \texttt{ring.number}, with the 3 levels ``n'' (none), ``o'' (one) and ``t'' (two) -- into a regression model? The answer is with a set of \textit{indicator}, also known as \textit{dummy}, variables: with $x_{ \{ 18 \} }$ = \texttt{ring.number} as an example (here $x_{ \{ 18 \} }$ means the 18th predictor variable), having $\ell = 3$ levels, You create a new variable $z_1$ that's 1 if $x_{ \{ 18 \} } =$ ``n'' and 0 otherwise, and another new variable $z_2$ that's 1 if $x_{ \{ 18 \} } =$ ``o'' and 0 otherwise, and a third new variable $z_3$ that's 1 if $x_{ \{ 18 \} } =$ ``t'' and 0 otherwise. 

If You now include all $\ell = 3$ of the $z_j$ in the set of predictors, in place of $x_{ \{ 18 \} }$, You will have created what's called a \textit{collinearity} problem: by the nature of how the $z_j$ were defined, for every mushroom $i$ in the data set it's a fact that $z_{ i1 } + z_{ i2 } + z_{ i3 } = 1$. This makes the $X$ matrix in equation (\ref{e:linear-regression-2}) non-invertible, meaning that the computation of the maximum-likelihood estimate of $\bm{ \beta }$, namely $\hat{ \bm{ \beta } } = ( X^T X )^{ -1 } \, X^T \, \bm{ y }$, would be more difficult to carry out. The (simple) solution is to omit one of the $z$ variables in the set of $z_j$ You include in the modeling: after all, in the \texttt{ring.number} example, if You knew $z_{ i1 }$ and $z_{ i2 }$, $z_{ i3 } = ( 1 - z_{ i1 } - z_{ i2 } )$ would be redundant (in the jargon of regression modeling, the category whose $z$ dummy has been left out is called the \textit{omitted group}). Letting $\ell_j$ be the number of levels of categorical predictor $x_j$ and setting $L = \sum_{ j = 1 }^k \ell_j$, the new linear regression model, expressed in terms of the dummy variables $z$, is
\begin{eqnarray} \label{e:linear-regression-3}
y_i & = & \beta_0 + \left[ \beta_1 \, z_{ i1 } + \dots + \beta_{ \ell_1 - 1 } \, z_{ i, \ell_1 - 1 } \right] + \left[ \beta_{ \ell_1 } \, z_{ i2 } + \dots + \beta_{ \ell_1 + \ell_2 - 2 } \, z_{ i, \ell_1 + \ell_2 - 2 } \right] \\ \nonumber 
& & \hspace*{0.25in} + \dots + \left[ \beta_{ L - K - ( \ell_k - 2 ) } \, z_{ i, L - K - ( \ell_k - 2 ) } + \dots + \beta_{ L - k } \, z_{ i, L - k } \right] + e_i \, .
\end{eqnarray}
This looks nasty but isn't: original categorical variable (factor) $x_1$ is replaced by $( \ell_1 - 1 )$ dummy variables, original factor $x_2$ is replaced by $( \ell_2 - 1 )$ dummies, and so on up to original factor $x_k$ being replaced by $( \ell_k - 1 )$ dummies, for a total of $k^* = ( L - k )$ dummy variables replacing the original $k$ factors. In the mushroom data set there are  $k = 21$ non-trivial factors as predictor variables, and the total number of dummies needed to carry this information is $[ ( 6 + 4 + 10 + 2 + 9 + 2 + 2 + 2 + 12 + 2 + 5 + 4 + 4 + 9 + 9 + 4 + 3 + 5 + 9 + 6 + 7 ) - 21 ] = 95$. (Where did I get the numbers $( 6 + 4 + \dots + 7 )$?)

\begin{itemize}

\item[(a)]

\textit{[5 total points for this sub-problem]} Create a new directory for this case study and download into this directory all of the files on the course web page that have `mushroom' in their names. I've written some \texttt{R} code for You, to start You on the analysis of this data set; it's in the file you just downloaded called

\hspace*{1.0in} \texttt{stat-206-mushroom-partial-data-analysis.txt}

There's a block of code at the top of the file that begins \texttt{`the first block of code starts here'} and ends \texttt{`the first block of code ends here'}; run this code block and study the output. The function \texttt{tab.sum} in this code block provides diagnostic information on whether a factor $x$ will turn out to be predictively useful in the modeling; briefly explain in what sense \texttt{tab.sum} provides such information (\textit{Hint:} the function estimates the conditional mean (and SD, not useful here) of what variable given what other variable?) \textit{[5 points]}.

\item[(b)]

\textit{[10 total points for this sub-problem]} Run the second code block, in which a linear regression model is fit with the dichotomous outcome \texttt{poisonous} regressed on the factor \texttt{cap.shape}, and study the output. When the predictions $\hat{ y }$ from equation (\ref{e:linear-regression-3}) are specialized to this regression, they looks like
\begin{equation} \label{e:linear-regression-4}
\hat{ y }_i = \hat{ \beta }_0 + \hat{ \beta }_1 \, z_{ i1 } + \dots + \hat{ \beta }_5 \, z_{ i5 } \, ,
\end{equation} 
in which the $\hat{ \beta }_j$ are the maximum-likelihood estimates of the regression coefficients and where \{$z_{ i1 } =$ 1 if \texttt{cap.shape} = `c' and 0 otherwise\}, \{$z_{ i2 } =$ 1 if \texttt{cap.shape} = `f' and 0 otherwise\}, and so on down to \{$z_{ i5 } =$ 1 if \texttt{cap.shape} = `x' and 0 otherwise\}. Now examine the extracts from the \texttt{tab.sum} and regression fitting in Table \ref{t:data-analysis-1}. Explicitly identify $( \hat{ \beta }_0, \dots, \hat{ \beta }_5 )$ in the output in the table \textit{[5 points]}, and -- by thinking about the form of equation (\ref{e:linear-regression-4}) for each of the levels of \texttt{cap.shape} -- explicitly relate the numbers in the \texttt{mean} column of the \texttt{tab.sum} output in Table \ref{t:data-analysis-1} to the $\hat{ \beta }_j$ \textit{[5 points]}.

\begin{table}[t!]

\centering

\caption{\textit{Extracts from the output of the second code block.}}

\begin{verbatim}
          #      cap.shape n    mean      sd       

          # [1,] 1         452  0.1061947 0.308428 
          # [2,] 2         4    1         0        
          # [3,] 3         3152 0.4936548 0.5000391      output of tab.sum
          # [4,] 4         828  0.7246377 0.4469667
          # [5,] 5         32   0         0        
          # [6,] 6         3656 0.4671772 0.4989898

          # Coefficients:
          #             Estimate Std. Error t value Pr(>|t|)    
          # (Intercept)  0.10619    0.02279   4.659 3.22e-06 ***
          # cap.shapec   0.89381    0.24335   3.673 0.000241 ***   output of
          # cap.shapef   0.38746    0.02437  15.898  < 2e-16 ***   linear
          # cap.shapek   0.61844    0.02834  21.824  < 2e-16 ***   regression
          # cap.shapes  -0.10619    0.08864  -1.198 0.230926    
          # cap.shapex   0.36098    0.02416  14.942  < 2e-16 ***
\end{verbatim}

\label{t:data-analysis-1}

\end{table}

\item[(c)]

\textit{[20 total points for this sub-problem]} Toward the end of the second code block, the code computes predicted $\hat{ p } = P ( y = 1 \given x_1 \, \mathbb{ L } \, \mathcal{ B } )$ values (in which $\mathbb{ L }$ signifies the linear regression modeling assumption, which is not part of $\mathcal{ B }$), makes a diagnostic plot and computes a numerical diagnostic -- the \textit{Predictive Separation Index (PSI)} -- measuring the predictive strength of the factor \texttt{cap.shape}.

\begin{itemize}

\item[(i)]

\textit{[10 total points for this sub-problem]} The diagnostic plot is in two parts: the top panel is a histogram of the $\hat{ p }$ values for the mushrooms for which $y = 0$, and the bottom panel shows the same thing except for the cases in which $y = 1$. What would the ideal shape of these histograms be, if a factor $x$ offers perfect information for predicting a dichotomous outcome $y$? Explain briefly \textit{[5 points]}. Do the histograms achieve that goal with the predictor \texttt{cap.shape}? Explain briefly \textit{[5 points]}.

\item[(ii)]

\textit{[10 total points for this sub-problem]} The PSI, which is a numerical index that goes hand-in-hand with the diagnostic plot, is defined as follows:
\begin{equation} \label{e:psi-1}
PSI ( x ) = [ \textrm{mean } ( \hat{ p } \textrm{ given } x ) \textrm{ when } y = 1 ] - [ \textrm{mean } ( \hat{ p } \textrm{ given } x ) \textrm{ when } y = 0 ] \, .
\end{equation}
What's the ideal value of the $PSI$, if a factor $x$ is perfectly predictive of $y$? Explain briefly \textit{[5 points]}. Does the $PSI$ come close to achieving that goal with \texttt{cap.shape}? Explain briefly \textit{[5 points]}.

\end{itemize}

\begin{table}[t!]

\centering

\caption{\textit{Predictive accuracy of each of the factors $x$ in the mushroom data set, with $PSI$ sorted from largest to smallest.}}

\bigskip

\begin{tabular}{c|cc}

Factor $( x )$ & $PSI$ & Predictive Power \\

\hline

$\vdots$ & $\vdots$ & $\vdots$ \\

\texttt{cap.shape} & 0.0603 & weak \\

\texttt{cap.surface} & 0.0388 & weak \\

$\vdots$ & $\vdots$ & $\vdots$ \\

\end{tabular}

\label{t:summary-table-1}

\end{table}

\item[(d)]

\textit{[20 total points for this sub-problem]} Run the third code block and study the output. I've written a function called \texttt{univariate.exploration} that automates the process of repeating the first and second code blocks; run this function with each of the other 20 categorical predictors (save \texttt{odor} for last, for reasons that will become clear); in each case, pay particular attention to the table created by \texttt{tab.sum}, the diagnostic plot and the $PSI$ value. Summarize Your findings by completing Table \ref{t:summary-table-1} (sort your entries from highest $PSI$ down to lowest); I suggest that You use the phrases \textit{extremely strong}, \textit{strong}, \textit{moderate}, \textit{weak}, and \textit{almost none} to describe the predictive power of each $x$ variable (you can choose your own cutpoints defining those categories) \textit{[15 points]}. If You were going to base the app on only one or two predictors, which ones look like the best candidates? Explain briefly \textit{[5 points]}.

\item[(e)]

\textit{[15 total points for this sub-problem]} In the output from code block 3, the PSI for \texttt{cap.surface} came out 0.03877928, which we could round to 0.03878. That number appears somewhere else in the regression output; where? Read pages 748--749 in DeGroot and Schervish (2012), available on the course web page; based on Your reading of these pages, briefly explain what the number in the regression output is trying to measure \textit{[5 points]}. Does it make sense that the PSI and this number are closely related? (This relation only holds for regressions with a dichotomous outcome; if $y$ is continuous, the PSI doesn't make sense.) \textit{[5 points]} Check several other sets of output from the \texttt{univariate.exploration} function with different predictors; is the relation between the PSI and the number in the regression output always the same? \textit{[5 points]}

\end{itemize}

Run the fourth code block, in which a linear regression model is fit with all available predictors (let's call this the \textit{full model} $\mathbb{ F }$), and study the output. You can see that \texttt{R} has a convenient way (\texttt{poisonous} $\sim$ \texttt{.}) to specify all of the predictors without having to name all of them. You can further see that prediction of the poisonous status of all $n =$ 8,124 mushrooms using the full model is perfect: all of the truly poisonous mushrooms have estimated $P ( y_i = 1 \given \mathbb{ L } \, \bm{ x }_i \, \mathbb{ F } \, \mathcal{ B } ) = 1$, and all of the truly edible mushrooms have estimated $P ( y_i = 1 \given \bm{ x }_i \, \mathbb{ L } \, \mathbb{ F } \, \mathcal{ B } ) = 0$ (here $\bm{ x }_i$ is the vector of predictor variables for mushroom $i$). 

However, this evaluation of the predictive quality of $\mathbb{ F }$ may overstate its accuracy, because we used the same (entire) data set both to fit $\mathbb{ F }$ and then to see how good $\mathbb{ F }$ is. As mentioned in class, \textit{cross-validation (CV)} is a good way to check on the extent of any over-fitting: You partition the data set at random into non-overlapping subsets, fit the model on one subset, and evaluate the quality of the fit on another. A well-established CV method is called \textit{$s$--fold cross-validation}: randomly partition the entire data set into $s$ non-overlapping exhaustive subsets $\{ S_1, \dots, S_s \}$, and loop as $j$ (say) goes from 1 to $s$: set aside subset $S_j$, fit the model $M_j$ on the union of all of the other subsets, and evaluate the quality of the fit on $S_j$, by using $M_j$ to predict all of the $y$ values in $S_j$; when the loop is finished, average the resulting $s$ quality estimates to get an overall evaluation of the model's predictive accuracy that avoids over-fitting.

\begin{itemize}

\item[(f)]

\textit{[10 total points for this sub-problem]} Run the fifth code block, which implements $s$--fold CV with $s = 10$ (this choice has been shown to be reliable), and study the output, which is summarized in a graph and a number: the graph plots the cross-validated predictions against the true values of the outcome variable \texttt{poisonous}, and the number is the CV estimate of what's called the \textit{root-mean-squared error (RMSE)} $\hat{ \sigma }$ of the regression predictions, namely $\hat{ \sigma } = \sqrt{ \frac{ 1 }{ n } \sum_{ i = 1 }^n ( y_i - \hat{ y }_i )^2 }$ (here $\hat{ y }_i$ is the predicted value of $y_i$; what the code actually does is (a) compute the $s = 10$ separate estimates $\hat{ \sigma }_j$ of the $RMSE$ arising from the cross-validation process and then (b) combine the $\hat{ \sigma }_j$ values optimally with $\hat{ \sigma } = \sqrt{ \frac{ 1 }{ n } \sum_{ j = 1 }^s \hat{ \sigma }_j^2 }$). Does the graph arising from the CV process support the idea that the predictions from the full model $\mathbb{ F }$ are perfect, even when cross-validated? Explain briefly \textit{[5 points]}. Does the cross-validated $RMSE$ value also support this idea? Explain briefly \textit{[5 points]}. 

\end{itemize}

Now that we've achieved the rare feat of perfect prediction, let's think about the app we're designing: do we really want to make users supply multiple-choice answers to 21 different questions about the mushroom they're thinking of eating?
The next (and nearly final) task in this problem is to see if a subset of the full set of 21 predictors can do as well, or nearly as well, in predictive accuracy as the full model $\mathcal{ F }$.

\begin{itemize}

\item[(g)]

\textit{[10 total points for this sub-problem]} Run the sixth code block, which implements a method called \textit{step-wise variable selection}, using the \textit{Bayesian Information Criterion (BIC)} we talked about in class; recall that lower $BIC$ values correspond to better models. The output of this code block is sufficiently voluminous that I put it into another \texttt{.txt} file, also available on the course web page:

\hspace*{0.75in} \texttt{stat-206-mushroom-analysis-variable-selection-with-bic.txt}

Study the output from this code block. This implementation of the \texttt{R} function \texttt{step} starts with the \textit{null model} consisting of just an intercept term, and then sequentially chooses the best variable not yet in the model and adds it. For the mushroom data, the algorithm goes through 11 iterations of this method, until it discovers that the model with 10 sequentially-best predictors yields perfect predictions, at which point it stops with an excellent and snarky warning message. By thinking about what the output is saying about the best subset of $x$ variables, and in what order, answer the following three questions, as $k$ goes from 1 to 3: 

\begin{quote}

If the app were going to be based on only $k$ variable(s), which \{one is\}/\{ones are\} best?

\end{quote}

Explain briefly \textit{[10 points]}. 

\begin{table}[t!]

\centering

\caption{\textit{Cross-tabulation of truth against what the app says for decision rules based on \texttt{odor} with two $\hat{ p }$ cutoffs, 0.05 (left) and 0.01 (right).}}

\bigskip

\begin{tabular}{c|c}

\textsf{0.05 Cutoff} & \textsf{0.01 Cutoff} \\ \cline{1-2}

\begin{tabular}{cc|c|c|c}

& \multicolumn{1}{c}{} & \multicolumn{2}{c}{\textbf{Truth}} \\

& \multicolumn{1}{c}{} & \multicolumn{1}{c}{Poisonous} & \multicolumn{1}{c}{Edible} & Total \\ \cline{3-4}

\textbf{App} & Poisonous & \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} \\ \cline{3-4}

\textbf{Says} & Edible & \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} \\ \cline{3-4}

& \multicolumn{1}{c}{Total}

\end{tabular}

& 

\begin{tabular}{cc|c|c|c}

& \multicolumn{1}{c}{} & \multicolumn{2}{c}{\textbf{Truth}} \\

& \multicolumn{1}{c}{} & \multicolumn{1}{c}{Poisonous} & \multicolumn{1}{c}{Edible} & Total \\ \cline{3-4}

\textbf{App} & Poisonous & \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} \\ \cline{3-4}

\textbf{Says} & Edible & \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} \\ \cline{3-4}

& \multicolumn{1}{c}{Total}

\end{tabular}

\end{tabular}

\label{t:decision-rules-1}

\end{table}

\item[(h)]

\textit{[25 total points for this sub-problem]} Finally, suppose that we tentatively decide to base the app only on the mushroom's \texttt{odor}. We would then still have to specify a decision rule to implement in the app, as a function of the $\hat{ p }$ value it produces for a new mushroom. It's easy to show (You're not asked to show this) that the optimal decision rule is of the form

\begin{quote}

If $\hat{ p } \ge c$, declare the mushroom poisonous; otherwise declare it edible

\end{quote}

for some $0 \le c \le 1$. Run the seventh code block, which summarizes the quality of two \texttt{odor}--based decision rules, one with $c = 0.05$ and the other with $c = 0.01$. Fill out Table \ref{t:decision-rules-1} above by
\textbf{carefully} re-arranging the output of the final two \texttt{table} commands in the code block \textit{[10 points]}. Letting (as usual with classification rules) \{App says poisonous\} be a positive (+) finding and \{App says edible\} be a negative (--) result, use Your filled-out Table \ref{t:decision-rules-1} to estimate the false-positive and false-negative rates for each of the 0.05-- and 0.01--cutoff rules  (You may wish to refer back to the \textit{ELISA} case study in class and/or Take-Home Test 1) \textit{[10 points]}. Considering the real-world implications of a false-positive error, and repeating for false-negative mistakes, is the 0.05--cutoff rule acceptable as a basis for our app? What about the 0.01--cutoff rule? Explain briefly in both cases \textit{[5 points]}.

\item[(i)]

\textit{Extra credit [up to 20 extra points]:} Repeat part (h) (modifying code block 7 appropriately) with the app based on the best \textit{two} predictor variables (instead of just \texttt{odor}), and exploring new cutoff values $c$. Is there now, with the two best predictors instead of one, an optimal cutoff that You regard as an acceptable trade-off between false-positive and false-negative mistakes, if You were going to sell the resulting app to wild-mushroom hunters? Explain briefly \textit{[up to 10 extra-credit points]}; \textit{[up to 10 more extra-credit points]} for repeating (h) with the best \textit{three} predictors.

\end{itemize}

\item[(B)]

\textit{[220 total points for this problem]} Email has been around since 29 Oct 1969, when a researcher named Ray Tomlinson sent the first email message over a network called \texttt{ARPANET}. It appears that the first email \textit{spam} message --- \texttt{Google dictionary} defines spam as ``irrelevant or inappropriate messages sent on the Internet to a large number of recipients'' --- was sent on 1 May 1978 (it was an advertisement for a presentation by the now-defunct \texttt{Digital Equipment Corporation} (DEC), sent by a \texttt{DEC} marketer to several hundred \texttt{ARPANET} users). It was reliably estimated\footnote{\texttt{https://www.statista.com/statistics/420391/spam-email-traffic-share}, accessed on 13 Mar 2021.} in Feb 2021 that about \textbf{71\%} of all email messages worldwide as of Apr 2014 were spam, and --- while that figure declined to about 47\% in Sep 2020 --- it's still a major problem today. What can You, as an email user, do to minimize the amount of spam You receive in Your inbox?

One idea would be to construct a type of algorithm called a \textit{spam filter}: a program that inputs an email message and outputs a tentative classification of that message as spam or not-spam (which is sometimes whimsically referred to as \textit{ham}). This problem is about an attempt by researchers at Hewlett-Packard Labs (HPL) in 1999 to construct a spam filter that was \textit{personalized} to a single individual named George Forman, the HPL postmaster at the time. George collected a representative sample of 1,813 messages that were certified as spam by him and by other HPL researchers, and he also obtained a representative sample of 2,788 email messages from HPL employees that were certified to be non-spam, yielding a data set with $n =$ 4,601 messages. 

\begin{quote}

\textbf{The file \texttt{spam-data-context.txt} on the course web page contains additional contextual information; please make a directory that will hold all of the files in this case study, and download and study the context file carefully before continuing.}

\end{quote}

Our data sets this quarter have all had the same structure: a matrix with one row for each \textit{subject} of study (e.g., people, strands of copper wire,~...) and one column for each \textit{variable} measured on the subjects. The HPL spam case study poses an interesting data-science problem: how do You summarize Your information when the 4,601 subjects You're studying are \textit{documents}?

\begin{itemize}

\item[(a)]

\textit{[20 total points for this sub-problem]} By answering questions (a)(i) and (a)(ii), briefly (but fully) describe the methods used by the HPL researchers to extract what they judged to be relevant information from the documents for the purpose of spam detection, and suggest new ideas for additional relevant information extraction (no points awarded for this paragraph; points given for your answers to (i) and (ii) below):

\begin{itemize}

\item[(i)]

Succinctly summarize section 7 in the context \texttt{.txt} file, listing all of the types of variables the HPL people defined (e.g., \textbf{48 continuous real [0,100] attributes of type~...}) and commenting on whether (in your view) their choices were reasonable. \textit{[10 points]}

\item[(ii)]

If You wanted to go beyond what the HPL researchers did in extracting spam-relevant information from their email messages, in what other ways would You data-mine the texts in these messages to obtain additional spam-predictive signal? Please suggest at least two additional methods, explaining briefly. \textit{[10 points]} 

\end{itemize}

\item[(b)]

\textit{[10 total points for this sub-problem]} The following statement is made in the context file: \textit{False positives (marking good mail as spam) are very undesirable.} Do You agree with this statement? Explain briefly. Do You agree with the following natural companion statement? \textit{In this problem, false positives are worse than false negatives.} Explain briefly.

\end{itemize}

\begin{quote}

\textbf{The HPL data set is on the course web page in the file \texttt{spam.txt} . I've written \texttt{R} code that helps You analyzes these data, in two files that are also on the course page --- \texttt{spam-partial-data-analysis-R.} \texttt{txt} and \texttt{spam-analysis-functions-R.txt} --- and there's a third file --- \texttt{spam-data-analysis-variable-summary.txt} --- that contains some of the output You'll produce. Please download all of these files into Your directory for this case study before going on with the rest of the problem.}

\end{quote}

Our first task is to understand the scales on which the predictor and outcome variables currently live, by making graphical and numerical summaries of them, both one by one and by examining the bivariate relationships between each predictor and the outcome. We'll be using the 2--way cross-validation \textit{2--CV} method in this analysis to guard against overfitting:

\begin{quote}

\textbf{Note that, toward the beginning of the first code block, the code uses the already-defined variable \texttt{testid} to randomly partition the dataset $D$ containing the $n = 4601$ email messages into a \textit{Modeling} subset $D_M$ (consisting of almost exactly $\frac{ 2 }{ 3 }$ \vspace*{0.025in} ($n_M = 3065$) rows) and a \textit{Validation} subset $D_V$ (containing almost exactly $\frac{ 1 }{ 3 }$ ($n_V = 1536$) rows). The strategy will be (a) to fit various models using only the data in $D_M$ and (b) to see how well they predict the outcome variable $y$ in $D_V$.}

\end{quote}

\begin{itemize}

\item[(c)]

\textit{[15 total points for this sub-problem]} Run the first code block in the file \texttt{spam-partial-data- analysis-R.txt} and examine the output. With reference to the results from the \texttt{summary( raw.data.modeling )} command, what's the minimum value taken on by each of the 57 predictors? What's the relationship between the median and the mean for each predictor? What do these two aspects of the data set imply about the histogram shapes of the predictors? Explain briefly.

\item[(d)]

\textit{[10 total points for this sub-problem]} To obtain the histograms mentioned in part (c), run the second code block in three steps: run plotting code block 1 and examine the results; repeat with plotting code block 2; and repeat with plotting code block 3, this time making a PDF file of the resulting plot and including it in Your solutions. Did the histograms come out as You expected them to? Explain briefly.

\end{itemize}

The usual descriptive tools for examining the way quantitative predictors $x$ and quantitative outcomes $y$ relate to each other are scatterplots (graphical) and correlations (numerical). These methods were created for Bivariate Normal situations, which we emphatically do not have here, but they're useful anyway as long as we don't interpret their results inappropriately. For example, correlations are \textit{attenuated} (diminished in absolute value) when $x$ or $y$ is binary; what would ordinarily be a mediocre correlation of (say) $+0.36$ with Bivariate Normal data is actually quite strong when $y$ is dichotomous.

\begin{itemize}

\item[(e)]

\textit{[10 total points for this sub-problem]} Run the third code block and examine the full output, which is truncated in the \texttt{.txt} file containing the code. Bearing in mind that $y$ is coded 1 for spam and 0 for non-spam, and that the emails were chosen so that the result is a personal spam filter for George Forman, do the magnitudes and $+/-$ signs of any of the following correlations of $y$ with predictors $x$ surprise You? $x =$ \{ `ch..4', `hp', `credit', `george', `ch..5', `crl.long' \} Explain briefly.

\end{itemize}

I bring up the issue of distributional shape of the predictors because research has shown that predictive signal is maximized (if the outcome is dichotomous) when the histogram of a quantitative predictor $x$ is close to Normal. If this is not true of $x$ on the original scale on which the data values were collected, You can often increase predictive power by \textit{transforming} $x$ to a new scale of measurement, using one of two approaches:

\begin{itemize}

\item[(I)]

You can look for an invertible function $g$ such that the distribution of $g ( x )$ is closer to Normality, or

\item[(II)]

You can create a set of indicator variables from $x$ by partitioning it at its quantiles, and model $x$ as a \textit{factor} instead of as a continuous predictor. If the sample size $n$ is large, a reasonable choice is to cut $x$ into 10 groups at its \textit{deciles} (the 0th, 10th, 20th,~...~and 100th percentiles); when $n$ is smaller, you may have to settle for only (say) 4 or 5 groups (again cut at the corresponding quantiles). The goal is for all of the indicator variables to have about the same number of subjects (here, email messages) for which the indicator takes the value 1; in other words, the cuts for predictor $x$ are \textit{not} equally spaced on the scale of $x$.

\end{itemize}

All of the models we'll fit in this problem are instances of \textit{logistic regression}, which has the general form (for $i = 1, \dots n$)
\begin{eqnarray} \label{e:logit-model}
( Y_i \given p_i \, \mathcal{ B } ) & \stackrel{ \text{\footnotesize I} }{ \sim } & \text{Bernoulli} ( p_i ) \nonumber \\
\text{logit} ( p_i ) \triangleq \log \left( \frac{ p_i }{ 1 - p_i } \right) & = & \sum_{ j = 1 }^k \beta_j \, x_{ ij } = ( X \, \bm{ \beta } )_j \, ;
\end{eqnarray}
here $\stackrel{ \text{\footnotesize I} }{ \sim }$ means \textit{are independently distributed as}, $X$ is an $( n \times k )$ matrix whose first column is all 1s to accommodate the intercept term and all of whose other columns are vectors $x_j$ containing the values of the predictor variables, and $\bm{ \beta }$ is a $( k \times 1 )$ vector of (unknown) regression coefficients $\beta_j$. This model is a special case of the class of methods known as \textit{generalized linear models} (GLMs); here the \textit{link function} that connects the $p_i$ to $X$ is logit$( p_i )$. \textbf{NB} With the notation in equation (\ref{e:logit-model}) above, the intercept is called $\beta_1$, not $\beta_0$ as in problem 2(A); the length of the resulting $\bm{ \beta }$ vector is then $k$ (the number of unknown parameters in the model), not $( k + 1 )$ as in the earlier Calculation problem in this test.

\texttt{R} has a built-in function called \texttt{glm} that performs maximum-likelihood fitting of model (\ref{e:logit-model}) and other GLMs; with $\bm{ y } = ( y_1, \dots, y_n )$ as the observed vector of binary outcomes and \texttt{x.1} and \texttt{x.2} as vectors containing the values of two of the predictor variables, the command \texttt{glm( y $\sim$ x.1, family = binomial( link = `logit' ) )} fits the logistic regression model in which the only predictor is \texttt{x.1}, and \texttt{glm( y $\sim$ x.1 + x.2, family = binomial( link = `logit' ) )} fits the corresponding model in which \texttt{x.1} and \texttt{x.2} are both included as predictors.

In logistic regression there's no closed-form expression for $\hat{ \bm{ \beta } }_{ MLE }$; iterative numerical methods have to be used to fit the model. \texttt{R} uses a natural and efficient method called \textit{Fisher scoring}, which is just an application of the Newton-Raphson method to the logistic regression log likelihood function. Large-sample estimated standard errors for the components of $\hat{ \bm{ \beta } }_{ MLE }$ are then based as usual on the observed information matrix, and \texttt{R} calculates signal-to-noise ratios
\begin{equation} \label{signal-to-noise-1}
 \hat{ z }_j = \left( \hat{ \bm{ \beta } }_{ MLE } \right)_j / \widehat{ SE } \left[ \left( \hat{ \bm{ \beta } }_{ MLE } \right)_j \right]
\end{equation}
to help in identifying which predictors (after adjusting for all the other predictors) make the strongest predictive contributions.

As with all statistical models, we need a method for judging when one logistic regression model is better than another. The \texttt{glm} function has the frequentist method \textit{AIC} built into it that goes hand-in-hand with maximum-likelihood fitting of a GLM; as we discussed in class, \textit{AIC} is similar to \textit{BIC} in that they both trade off goodness-of-fit of a model against its complexity, but \textit{AIC} and \textit{BIC} have different complexity penalties. For any parametric model $\mathbb{ M }$ containing a $( k \times 1 )$ vector $\bm{ \theta }$ of unknown quantities (parameters) and no random effects, it turns out that the complexity of $\mathbb{ M }$ (on the log likelihood scale) is meaningfully defined simply to be $k$, and goodness of fit can be measured by calculating the maximum value attained by the log-likelihood function $\ell \ell ( \bm{ \theta } \given \bm{ y } \, \mathbb{ M } \, \mathcal{ B } )$ in model $\mathbb{ M }$. With this in mind, \textit{AIC} is computed as follows:
\begin{equation} \label{e:aic}
AIC ( \mathbb{ M } \given \bm{ y } \, \mathcal{ B } ) = - 2 \, \ell \ell ( \hat{ \bm{ \theta } }_{ MLE } \given \bm{ y } \, \mathbb{ M } \, \mathcal{ B } ) + 2 \, k \, . 
\end{equation}
Because we want the maximum log-likelihood value of a model to be big as an indication of its goodness of fit, the multiplication of $\ell \ell ( \hat{ \bm{ \theta } }_{ MLE } \given \bm{ y } \, \mathbb{ M } \, \mathcal{ B } )$ by $- 2$ means that models with \textit{small} values of \textit{AIC} are preferred (where by the phrase ``\textit{AIC}$_1$ is \textit{smaller than} \textit{AIC}$_2$'' I mean that \textit{AIC}$_1$ is \textit{to the left of} \textit{AIC}$_2$ \textit{on the number line}, not that \textit{AIC}$_1$ is \textit{closer to 0} than \textit{AIC}$_2$).

\begin{itemize}

\item[(f)]

\textit{[10 total points for this sub-problem]} Run the fourth code block, stopping and starting as recommended; study the logic behind the commands and examine the output. This chunk of code implements an instance of method (I) mentioned above, with a transformation of the form $x \rightarrow \log ( x + C )$ on the predictor variable \texttt{make} plus an indicator variable for \texttt{( make = 0 )}. Two logistic regressions are run in this code block: \texttt{spam} predicted by \texttt{make}, and \texttt{spam} predicted by \texttt{log( make + C )} and \texttt{make.is.0}. Identify the \textit{AIC} values for these two models. Has this transformation succeeded in extracting additional predictive signal from \texttt{make}? Explain briefly.

\end{itemize}

The fifth code block implements an instance of method (II) mentioned above (creating categorical indicators from a continuous predictor $x$ and using them instead of $x$ itself), and also creates a diagnostic plot that helps to see how a predictor might optimally enter the modeling. If a predictor $x$ has a non-linear relationship with logit$( p )$, the simplest form of non-linearity is quadratic, so people often create a new variable \texttt{x.squared <- x * x} and logistically regress $y$ on both \texttt{x} and \texttt{x.squared} to explore this possibility. The function I've written for You does six things with a predictor \texttt{x}: 

\begin{itemize}

\item[(\textit{1})]

it creates the categorical variable \texttt{x.cut} from \texttt{x} by partitioning \texttt{x} at a user-specified number of quantiles, and runs the function \texttt{tab.sum} (from problem 2(A)) to examine the resulting predictive signal; 

\item[(\textit{2})]

it logistically regresses \texttt{y} on \texttt{x} and harvests the \textit{AIC} value; 

\item[(\textit{3})]

it repeats (\textit{2}) with the predictors \texttt{x} and \texttt{x.squared} and grabs the resulting \textit{AIC}; 

\item[(\textit{4})]

it repeats (\textit{2}) with the predictor \texttt{x.cut} and takes note of the \textit{AIC} value that results; 

\item[(\textit{5})]

it prints a summary table of the three \textit{AIC} values to help You see which modeling approach is best; and 

\item[(\textit{6})] 

it creates an interesting plot. 

\end{itemize}

The red dots in the plot are the observed \texttt{x} and \texttt{y} values, so that You can see where most of the data is concentrated as You scan from left to right; it computes the predicted \texttt{p.hat} values based on \texttt{x}, transforms them to the logit scale, and plots a \texttt{lowess} smooth of the transformed \texttt{p.hat} values against \texttt{x}, together with a user-specified number of bootstrap replicate \texttt{lowess} curves, to give You an idea both of the the relationship between \texttt{x} and logit( \texttt{p.hat} ) and the uncertainty in that relationship. (I've truncated the plot at $\hat{ p } = 0.999$ and 0.001, to keep it from going so wide vertically that the information in the plot for $0.001 < \hat{ p } < 0.999$ is harder to absorb.)

\begin{itemize}

\item[(g)]

\textit{[45 total points for this sub-problem]} Run the fifth code block, stopping after each call to the function \texttt{logistic.regression.univariate.exploration} to examine the output before going on to the next call.

\begin{itemize}

\item[(i)]

Examine the \texttt{tab.sum} results for the predictor \texttt{make} with 10 requested categories. Is there predictive signal in \texttt{x.cut} with this choice of $x$? Explain briefly \textit{[5 points]}.

\item[(ii)]

Looking at the \textit{AIC} summary table for \texttt{make} with \texttt{n.cutpoints} = 10, which of the three approaches to bringing this $x$ into the modeling is most successful? Explain briefly \textit{[5 points]}.

\item[(iii)]

Does the diagnostic plot for the predictor \texttt{make} exhibit curvature in the relationship between \texttt{make} and logit$( p )$? Is this confirmed in the logistic regression results when $y$ is modeled as a function of \texttt{make} and \texttt{make.squared}? (\textit{Hint:} Look at the table of estimated coefficients for the quadratic regression model.) Explain briefly \textit{[10 points]}.

\item[(iv)]

Is there noticeably more predictive signal in \texttt{make} when \texttt{n.cutpoints} = 20 is specified than with \texttt{n.cutpoints} = 10? If we try to go on with \texttt{n.cutpoints} = 30 or 40 or~..., what do You expect will eventually happen?  Explain briefly \textit{[10 points]}.

\item[(v)]

In part (e) of this question You probably noticed that the word \texttt{your} 
had the strongest univariate predictive signal. Is this fact reflected in the diagnostic plot for this predictor? Explain briefly \textit{[5 points]}.

\item[(vi)]

Is there more predictive signal in \texttt{x.cut} based on \texttt{make} with \texttt{n.cutpoints} = 10 (method (II)) than the corresponding signal in \{\texttt{log( make + C )} + \texttt{make.is.0}\} (method (I))? Explain briefly \textit{[5 points]}.

\end{itemize}

\item[(h)]

\textit{[10 total points for this sub-problem]} Run the sixth code block; study the logic behind the commands and examine the full output of the \texttt{glm( )} function call, which is truncated in the \texttt{R} code file but available in full in the file \texttt{spam-data-analysis-glm-results.txt} on the course web page. 

\begin{itemize}

\item[(i)]

As explained in the \texttt{R} code file, run the function 

\begin{quote}

\texttt{logistic.regression.effect.of.adding.a.new.variable} 

\end{quote}

three times, with \texttt{beta.x} = $0$, then $-3$, then $+3$. Would you say that the predicted probabilities of $( y = 1 )$ (the \texttt{p.hat} values in the function call) change dramatically as \texttt{beta.x} goes from $-3$ to $+3$? Explain briefly \textit{[5 points]}. 

\item[(ii)]

Find the five predictor variables with the largest absolute $\hat{ z }_j$ values in the \texttt{glm( )} output, and compare that list (in order) with the five predictor variables in code block 3 having the highest correlations with the outcome variable \texttt{spam}. Should it worry us that these lists are not the same? Explain briefly \textit{[5 points]}.  

\end{itemize}

\item[(i)]

\textit{[10 total points for this sub-problem]} Run the seventh code block, stopping and starting as recommended; study the logic behind the commands and examine the output. 

\begin{itemize}

\item[(i)]

What value of the predictive separation index (PSI) from problem 2(A) does the logistic regression model with all 57 predictors achieve, when the model fit to the \textsf{Modeling} subset of data is evaluated on the same subset? Where would you regard these predictions on the scale \{terrible, not great, okay, good, terrific\}? Explain briefly \textit{[5 points]}. 

\item[(ii)]

How much did the PSI change when the model (fit to the \textsf{Modeling} subset) was required to predict the email messages in the \textsf{Validation} data set? Does this mean that the logistic regression model is guilty of a \{large, moderate, small\} amount of over-fitting? Explain briefly \textit{[5 points]}.

\end{itemize}

\item[(j)]

\textit{[35 total points for this sub-problem]} Run the eighth code block, stopping and starting as recommended; study the logic behind the commands and examine the output. The point of this code block is to fit a Bayesian logistic regression model, to the \textsf{Modeling} subset, that shrinks unstable maximum-likelihood regression coefficients toward 0 to stabilize the prediction process; a carefully constructed prior called the \textit{horseshoe} prior is used to accomplish this shrinkage.

\begin{itemize}

\item[(i)]

The \textit{ESS (effective sample size)} value for a logistic regression coefficient is an estimate of the equivalent \textit{Monte Carlo (MC)} sample size for that parameter if \textit{IID} sampling had been available instead of MCMC. The values marked \texttt{ESS} in the output of \texttt{bayesreg} are mis-labeled; they're really ratios of the form
\begin{equation} \label{ess-1}
\textrm{ESS \%} \triangleq 100 \left( \frac{ \begin{array}{c} \textrm{number of IID MC monitoring values} \\ \textrm{to achieve a given accuracy goal} \end{array} }{ \begin{array}{c} \textrm{number of MCMC monitoring values} \\ \textrm{to achieve the same accuracy goal} \end{array} } \right) \% \, .
\end{equation}
If the Markov chain is mixing well, all of these values should be at or near 100\%. Would you say that the chain created by \texttt{bayesreg} is mixing well, poorly, or in between? Explain briefly \textit{[10 points]}.

\item[(ii)]

The predictors with the largest standardized regression coefficients in absolute value in the maximum likelihood logistic regression are \texttt{george} and \texttt{cs}. Has the Bayesian model succeeded in stabilizing the maximum likelihood estimation process, at least to some extent, for those two predictors? What percentage of the $k = 57$ coefficients have been shrunken toward 0 in the Bayesian fitting process? Explain briefly \textit{[10 points]}.

\item[(iii)]

Are the Bayesian predictions dramatically more accurate than the maximum likelihood results? Is the Bayesian fitting process guilty of more over-fitting than maximum likelihood, or less, or about the same?  Explain briefly \textit{[10 points]}.

\item[(iv)]

Examine the tables, at the end of code block 8, that include uncertainty bands for the \texttt{p.hat} values from the Bayesian fitting. Would you say that, even with 3,065 email messages in the \textsf{Modeling} subset, there is substantial uncertainty in the prediction process? Explain briefly \textit{[5 points]}.

\end{itemize}

\end{itemize}

\begin{quote}

\textbf{The last few parts of this problem are intended to whet your appetite for continued data science study: you'll be using tools that were only developed in the past 10 years or so.}

\end{quote}

\begin{itemize}

\item[(k)]

\textit{[20 total points for this sub-problem]} Run the ninth code block; study the logic behind the commands and examine the output. This chunk of code fits (via maximum likelihood) a logistic regression model in which we include not only all $k = 57$ predictors but also all \textit{two-way interactions} between those variables. To create an interaction predictor between (say) $x_1$ and $x_2$, we just form the product $( x_1 \cdot x_2 )$, and interacting a variable $x_1$ with itself adds the quadratic term $( x_1 )^2$ to the modeling; this brings in \textit{non-linear} predictive information, which can greatly improve the predictive process. Let's call the model with all \textit{main effects} (the information in the $k = 57$ predictors) and all two-way interactions $\mathbb{ M }_{ big }$.

\begin{itemize}

\item[(i)]

Compare the typical magnitudes of the logistic regression coefficients in $\mathbb{ M }_{ big }$ with the corresponding values in the earlier regression with only the 57 main effects. Do you agree with the statement \textit{It looks like maximum-likelihood logistic regression has gone crazy}? Explain briefly \textit{[5 points]}.

\item[(ii)]

How do the predictions from $\mathbb{ M }_{ big }$ appear to compare in quality (e.g., PSI) with those from the earlier model that included only the main effects, when $\mathbb{ M }_{ big }$ is used to predict the $y$ values for the rows of the data matrix that are in the modeling subset? Explain briefly \textit{[5 points]}. 

\item[(iii)]

How much of a fall-off is there from the PSI in (ii) to the PSI when $\mathbb{ M }_{ big }$ (fit to the modeling subset) is forced to predict the \texttt{spam} outcomes on the validation subset? Does this indicate a small, medium, or large amount of over-fitting by maximum likelihood logistic regression with 1,654 predictors and 3,065 (modeling subset) observations? Explain briefly \textit{[10 points]}.

\end{itemize}

\item[(l)]

\textit{[20 total points for this sub-problem]} Run the tenth code block; study the logic behind the commands and examine the output. This chunk of code uses a modification of maximum likelihood fitting of \textit{Generalized Linear Models (GLMs)}, of which logistic regression is a special case; in this modification, which is called the \textit{lasso}, the deviance of a model is \textit{penalized} in such a way that maximum likelihood's tendency to over-fit GLMs when $k$ (the number of predictors) is a substantial fraction of $n$ (the number of subjects in the data set) is greatly diminished.

\begin{itemize}

\item[(i)]

This code block contains a summary of the estimated coefficients in the main-effects-only model fit to the modeling subset using three different fitting methods: maximum likelihood, Bayesian regression with the horseshoe prior, and the lasso. Succinctly summarize any differences you see among the coefficient estimates from the three methods. \textit{[10 points]}.

\item[(ii)]

This chunk of code also contains a summary of the PSI attained by the three methods mentioned in part (i) immediately above. Are there striking differences among the methods in this measure of predictive accuracy? 
Explain briefly \textit{[5 points]}.

\item[(iii)]

The output at the end of this code block summarizes the degree of fall-off of predictive accuracy achieved by ordinary maximum likelihood and the lasso in moving from the (modeling subset fit, modeling subset predicted) to the (modeling subset fit, validation subset predicted) scenarios. We've already seen that unpenalized maximum likelihood badly fails this test with $( k, n ) = ( 1654, 3065 )$; is the lasso completely successful in this case study when viewed in this way? Explain briefly \textit{[5 points]}.

\end{itemize}

\item[(m)]

\textit{[5 total points for this sub-problem]} And finally, run code block 11; study the logic behind the commands and examine the output. This code chunk implements a highly successful machine learning predictive technique with statistical roots called \textit{gradient boosting (GB)}, which I'll explain in office hours. Simply treating GB as a black box (for lack of time), has the implementation of it in \texttt{R} called \texttt{xgboost} outperformed the lasso in the (modeling subset fit, validation subset predicted) scenario? Explain briefly.

\end{itemize}

There's lots more to do in this rich-with-ideas case study, but let's stop here for now.

\end{itemize}

\end{document}