#################################################################################
#################################################################################

# exploration in R of the bernoulli likelihood story
# with the full-employment data of quiz 2

# dd (2, 3, 4 feb 2021)

# preliminaries:

# (1) unbuffer the output (if relevant to your R environment)

# (2) set the working directory to the place where you'd like
# PDF versions of plots to be stored

# on my home desktop this is

#   setwd( 'C:/DD/Teaching/AMS-STAT-206/Winter-2021/Bernoulli-Likelihood' )

# (3) ensure that we start with a completely empty R workspace:

rm( list = ls( ) )

#################################################################################

# see the paper-and-pen work on the document camera
# that defines the Bernoulli likelihood function 

#   ell ( theta | y, B ) = c * theta^s * ( 1 - theta )^( n - s ) ,

# in which c can be any positive number you like (and you're allowed
# to move from one value of c to another if it helps you to see
# what's going on)

# i'll initially use c = 1 in what follows

# we're assuming here that the sample size n > 0,
# the number s of 'yes' answers satisfies s > 0,
# the number ( n - s ) of 'not-yes' answers satisfies ( n - s ) > 0,
# and 0 < theta < 1

# the cases n = 0 (no data) or ( theta = 0 or theta = 1: non-probabilistic)
# are uninteresting

# the cases s = 0 and s = n are interesting, but we won't examine them here

#################################################################################

##### examination of the likelihood function

# input the data from quiz 2

n <- 921

s <- 830

# define the bernoulli likelihood function

bernoulli.likelihood <- function( theta, s, n ) {

  likelihood <- theta^s * ( 1 - theta )^( n - s )

  return( likelihood )

}

# plot the bernoulli likelihood function
# with the quiz 2 data

theta.grid.1 <- seq( 0, 1, length = 500 )

plot( theta.grid.1, bernoulli.likelihood( theta.grid.1, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

# stop and think about the plot

# let's hold a magnifying glass up to the interesting range of theta
# (where the likelihood is meaningfully positive)

theta.grid.2 <- seq( 0.8, 1, length = 500 )

plot( theta.grid.2, bernoulli.likelihood( theta.grid.2, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

# stop again and think about the new plot

# let's continue to magnify the interesting range of theta

theta.grid.3 <- seq( 0.85, 0.95, length = 500 )

plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

# this is a good plot; i would publish it to myself

# there are two interesting things about this plot; what are they?

# answer 1: the vertical axis involves really small (close to 0) numbers

# in fact, we'll show at the end of this file that
# as s and n get bigger and bigger, the likelihood values
# will eventually get so small that R will report them as 0
# (in computer-science language this is an *underflow* issue)

# but let's not worry about that now; let's look at mr. fisher's story
# in a setting (e.g., quiz 2) in which it works just fine

# one more initial item:

# answer 2: the likelihood function looks a lot like a Normal PDF
# for theta, treating theta as if it were a random variable

# thinking of the likelihood function as an un-normalized 
# (not yet integrating to 1) Normal PDF for theta, 
# as the plot certainly invites us to do, is a bayesian, not frequentist, 
# interpretation of what's going on

# mr. fisher, who was loudly against bayesian methods
# (for a reason i'll give quite soon in lecture),
# wants us to think of this function simply as a function
# without a probabilistic meaning, and he wants to draw inferences
# about the unknown theta solely from this function (or something
# closely related to it; more on that later)

# in other words, he wants (a) to define a *point estimate* 
# theta.hat (a single best guess for theta), (b) compute 
# an approximate frequentist standard error
# for theta.hat, and (c) build an approximate *interval estimate*
# for theta; because of his dispute with mr. neyman,
# he wants to do these things in a non-mr. neyman way

#################################################################################

##### point estimation

# in 1922, mr. fisher offered the world the following suggestion:

#   trust me (says mr. fisher), you'll get a good point estimate
#   of theta by finding the value theta.hat of theta that 

#     *maximizes* the likelihood function ell ( theta | y, B )

# (fisher ra (1922) on the mathematical foundations of theoretical statistics.
# Transactions of the Royal Society of London, 222, 309-368)

# in math language he defines what he calls the 
# *maximum likelihood estimate* (MLE) with the equation

#   theta.hat.mle = argmax_{ theta in Theta } ell ( theta | y, B ) ,

# in which Theta is the *parameter space* (the set of all possible values
# of the unknown (parameter) theta (in this problem Theta = ( 0, 1 ) )

# if the likelihood function has more than one local maximum,
# mr. fisher says: find the global maximum

# ok, let's take mr. fisher's advice:

# question: how do you globally maximize a function like the one
# in the most recent plot?

# answer: well, you can see analytically and graphically that
# the function

#   ell ( theta | y, B ) = theta^s * ( 1 - theta )^( n - s )

# goes to 0 as theta approaches 0 or 1, so the only maximum visible
# in the plot has to be the global max

# so we can simply differentiate the likelihood function
# once with respect to theta and set the first derivative equal to 0

# we can do the this by hand or give it to wolfram alpha at

#   https://www.wolframalpha.com/

# here's a wolfram alpha command that works (remove the hashtag #
# before copying and pasting into wolfram alpha):

#   differentiate theta^s * ( 1 - theta )^( n - s ) in theta

# question: look under 'Roots'; what do you see?

# answer:

# wolfram alpha doesn't know the contextual information here,

#   we're assuming that the sample size n > 0,
#   the number s of 'yes' answers satisfies s > 0,
#   the number ( n - s ) of 'not-yes' answers satisfies ( n - s ) > 0,
#   and 0 < theta < 1 (otherwise the situation is non-probabilistic),

# because we haven't shared that information with it

# sidebar begins below this line

#   this can be remedied with the 'Assuming' command (remove the 
#   hashtags # before copying and pasting into wolfram alpha)

#     Assuming[ n > 0 && s > 0 && ( n - s ) > 0 && theta > 0 && theta < 1, 
#       { differentiate theta^s * ( 1 - theta )^( n - s ) in theta } ]

# sidebar ends above this line

# with the less fancy call to wolfram alpha, we can ignore
# three of the four 'Roots' and focus on the third one,
# using '!=' to stand for 'is not equal to':

#   n != 0, s * ( n - s ) != 0, theta = s / n

# that's the answer we want: the assumptions 'n != 0, s * ( n - s ) != 0'
# match our problem context, so mr. fisher's method here has yielded
# the MLE, which (in R) looks like this:

print( theta.hat.mle <- s / n )

# [1] 0.9011944

# notice that this is extremely intuitively reasonable here;
# our best guess for the population mean theta is the sample mean 

#   y.bar.n  = s / n

# in fact it's the same point estimate that mr. neyman told us to use

abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' )

legend( 0.925, 9e-130, legend = 'MLE', col = 'red', lty = 2 )

# new question: given that as n and s increase we can run into
# underflow problems, is there a simple transformation we can apply
# to the likelihood function that would greatly diminish
# that type of numerical problem?

# answer:

# let's define the *log likelihood function* and plot it:

#   ell ell ( theta | y, B ) = log( ell( theta | y, B ) )

# here after simplification this is

#   ell ell ( theta | y, B ) = s * log( theta ) + ( n - s ) * log( 1 - theta )

# define the bernoulli log likelihood function

bernoulli.log.likelihood <- function( theta, s, n ) {

  log.likelihood <- s * log( theta ) + ( n - s ) * log( 1 - theta )

  return( log.likelihood )

}

# plot the bernoulli log likelihood function
# with the quiz 2 data using the interesting theta grid:

theta.grid.3 <- seq( 0.85, 0.95, length = 500 )

plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Log Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

# question: what function does that look like to you,
# and why must this be true?

# answer: the likelihood function looks like a Gaussian PDF for theta,
# and the log of a Gaussian PDF is a bowl-shaped-down parabola

# let's line up the likelihood and log likelihood functions
# with the same horizontal axis:

par( mfrow = c( 2, 1 ) )

plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' )

legend( 0.925, 9e-130, legend = 'MLE', col = 'red', lty = 2 )

plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Log Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

par( mfrow = c( 1, 1 ) )

# it certainly seems that the maximum of the log likelihood function
# occurs at the same theta value as the maximum of the likelihood function 

# question: is that true in general?

# that idea works here:

par( mfrow = c( 2, 1 ) )

plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' )

legend( 0.925, 9e-130, legend = 'MLE', col = 'red', lty = 2 )

plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Log Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' )

legend( 0.87, -305, legend = 'MLE', col = 'red', lty = 2 )

par( mfrow = c( 1, 1 ) )

# and of course it will *always* work: log( x ) is a strictly increasing
# function for x > 0, which means that *IT'S ORDER-PRESERVING*,
# i.e., for all x.1 > 0 and x.2 > 0,

#   if x.1 < x.2 then log( x.1 ) < log( x.2 )

# thus

#   you have a choice in implementing mr. fisher's algorithm:
#   you can either (A) maximize the likelihood function or 
#   (B) maximize the log likelihood function, and the theta that
#   solves problem (A) is the same as the theta that solves problem (B)

# question: which of the { likelihood function, log likelihood function }
# will be easier to maximize analytically?

# answer:

# the argument is in three simple parts: (i) with (conditionally) IID data,
# the likelihood function will always be a product;
# (ii) therefore the log likelihood function will always be a sum;
# and (iii) it's easier to differentiate sums than products

# covered to here on tue 2 feb 2021

#################################################################################

##### uncertainty assessment for the MLE

# ok, we now have a point estimate, theta.hat.mle, of theta
# that mr. fisher says is 'good' (more about that later)

# now we need a give or take (uncertainty assessment) for the MLE

# as noted above, mr. fisher was a dyed-in-the-wool frequentist,
# so he wants to compute or approximate the repeated-sampling 
# estimated variance \hat{ V }_{ IID } ( theta.hat.mle ), 

#   *just from the likelihood or log likelihood function*; 

# he'll then take the square root to get the estimated standard error
# of the MLE:

#   SE.hat.of.theta.hat.mle

# here, of course, the random-variable version of the MLE is S / n,
# in which (under our sampling model) the sum S of the n 
# IID Bernoulli( theta ) random variables follows 
# the Binomial( n, theta ) distribution

# this means that, in this case study, the repeated-sampling variance
# of the random variable MLE, reasoning in mr. neyman's manner, is

#   V_{ IID } ( S / n ) = theta * ( 1 - theta ) / n         

# and we can get a good estimate of this by plugging the MLE for theta
# into this expression everywhere we see a theta:

#   \hat{ V }_{ IID } ( theta.hat.mle ) = 
#     theta.hat.mle * ( 1 - theta.hat.mle ) / n

# --> call this equation (*); we'll refer back to it later

# but (as noted above) mr. fisher wants to get this result

#   *directly from the likelihood or log likelihood function*

# that will make his estimated variance calculation quite general
# and not dependent on the specific form of the MLE

# to make progress we can reason as follows:

# as n goes up the estimated variance of the mle should go down

# in the quiz-2 case study it goes down like O ( 1 / n ) = c / n
# for some positive constant c

# so let's start with the quiz-2 data set (s.old = 830, n.old = 921)
# and try making a new data set in which

#  the sample size is multiplied by 10 -- n.new = 10 * n.old --
#  but the MLE remains the same; this will isolate the effect of n

#  so s.new = 10 * s.old

# i'll now plot the log likelihood functions with these two data sets
# on the same graph, but to make them comparable i'll force them
# to go through the same point ( theta.hat.mle, 0 )

plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ),
  type = 'l', xlab = 'theta', ylab = 'Log Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 830, n = 921 )' )

abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' )

# at the moment this version of the log likelihood function
# goes approximately through the point ( 830 / 921, - 295 )

# remember that the idea of the likelihood function includes
# a degree of freedom: to get *A* likelihood function,
# we can multiply the joint sampling distribution by any 
# positive constant c that suits us

# this means, since log converts multiplication into addition,
# that we can *ADD* any constant we like to the log likelihood function 
# (shifting it up or down vertically)

# to make a version of the log likelihood function that goes through
# the point ( 0, MLE ), i need to know what's the maximum value
# attained by the log likelihood function

# but this is easy to compute: it has to be the log likelihood function
# evaluated at the MLE:

print( original.data.maximum.log.likelihood.value <- 
  bernoulli.log.likelihood( theta.hat.mle, s, n ) )

# [1] -296.9771

# so now i can create the first part of the plot that i want to make:

plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ) + 
  abs( original.data.maximum.log.likelihood.value ),
  type = 'l', xlab = 'theta', ylab = 'Vertically Shifted Log Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model' )

abline( v = theta.hat.mle, lwd = 3, lty = 3, col = 'red' )

abline( h = 0, lty = 1, lwd = 2, col = 'black' )

# now let's do the same thing with the new data set,
# in which s and n have both been multiplied by 10
# (since the MLE is s / n, this of course leaves the MLE unchanged)

print( new.data.maximum.log.likelihood.value <- 
  bernoulli.log.likelihood( theta.hat.mle, 10 * s, 10 * n ) )

# [1] -2969.771

# question: how does this relate in this case study to
# the previous maximum log likelihood value? why?

# answer: it equals 10 times the old value, because the log likelihood
# function in this case study is additive in s and (n - s )

# question: what should the new plot look like, with both curves
# superimposed? 

# answer: the new log likelihood function will approach the point
# ( theta.hat.mle, 0 ) much more steeply, both from the left
# and from the right

lines( theta.grid.3, 
  bernoulli.log.likelihood( theta.grid.3, 10 * s, 10 * n ) + 
  abs( new.data.maximum.log.likelihood.value ),
  type = 'l', lwd = 3, lty = 2, col = 'blue' )

legend( 0.85, -12.5, legend = c( 'n = 921', 'n = 9210', 'MLE' ), 
  col = c( 'darkcyan', 'blue', 'red' ), lty = c( 1, 2, 3 ) )

# question: how are the green and blue function similar?
# how are they different?

# answers: (similar) same basic shape (concave on the log likelihood scale);
# they pass through the same point ( theta.hat.mle, 0 ); they have the same
# first partial derivative with respect to theta at that point;
# (different) they have sharply different values of

#   - the second partial derivative of the log likelihood function
#   at the MLE

# let's see how this idea plays out on the likelihood scale

# to get likelihood values, i just have to exponentiate
# the log likelihood values in the most recent plot

par( mfrow = c( 2, 1 ) )

plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ) + 
  abs( original.data.maximum.log.likelihood.value ),
  type = 'l', xlab = 'theta', ylab = 'Vertically Shifted Log Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model' )

abline( v = theta.hat.mle, lwd = 3, lty = 3, col = 'red' )

abline( h = 0, lty = 1, lwd = 2, col = 'black' )

lines( theta.grid.3, 
  bernoulli.log.likelihood( theta.grid.3, 10 * s, 10 * n ) + 
  abs( new.data.maximum.log.likelihood.value ),
  type = 'l', lwd = 3, lty = 2, col = 'blue' )

plot( theta.grid.3, exp( bernoulli.log.likelihood( theta.grid.3, s, n ) + 
  abs( original.data.maximum.log.likelihood.value ) ),
  type = 'l', xlab = 'theta', ylab = 'Vertically Scaled Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model' )

abline( v = theta.hat.mle, lwd = 3, lty = 3, col = 'red' )

lines( theta.grid.3, 
  exp( bernoulli.log.likelihood( theta.grid.3, 10 * s, 10 * n ) + 
  abs( new.data.maximum.log.likelihood.value ) ),
  type = 'l', lwd = 3, lty = 2, col = 'blue' )

par( mfrow = c( 1, 1 ) )

# ok, so what have we learned?

# as n increases, holding the MLE constant, 

#   minus the second partial derivative of the log likelihood function
#   with respect to theta, evaluated at the MLE, goes *UP*;

#   our inferential *information* about theta has gone *UP*; and therefore

#   our inferential *uncertainty* about theta, as measured by the
#   repeated-sampling variance of the (random-variable version of the) MLE
#   goes *DOWN*

# by a different argument (involving a taylor expansions of the
# log likelihood function) mr. fisher drew the same conclusion
# in considerably more generality than in our examination of
# the quiz 2 case study: he defined

#   the *information* I ( theta.hat.mle ) = minus the second partial derivative 
#   of the log likelihood function with respect to theta, evaluated at 
#   theta = theta.hat.mle

# for a look at the taylor-series story, you could study
# the reply by Deep North at stackexchange:

#   https://stats.stackexchange.com/questions/174600/help-with-taylor-expansion-of-log-likelihood-function

# mr. fisher had a huge ego, so he wanted people to call this quantity
# *fisher information*, but he knew that by academic convention
# you personally can't name one of your ideas after yourself,
# but he didn't have long to wait before people started calling it
# fisher information

# let's compute the information symbolically here, with wolfram alpha:

#   second derivative of - ( s * log( theta ) + ( n - s ) * log( 1 - theta ) ) with respect to theta

#     ( n - s ) / ( 1 - theta )^2 + s / theta^2

#   evaluate ( n - s ) / ( 1 - theta )^2 + s / theta^2 at theta = s / n

#     n^3 / ( n * s - s^2 )

# we want this expression in terms of the MLE theta.hat.mle = s / n

# divide both numerator and denominator by n^2 to get

#     n / ( ( s / n ) - ( s / n )^2 ) = 
#       n / ( theta.hat.mle - theta.hat.mle^2 ) =
#       n / ( theta.hat.mle * ( 1 - theta.hat.mle ) )

# this is the information about theta in this sampling model:

#   I ( theta.hat.mle ) = n / ( theta.hat.mle * ( 1 - theta.hat.mle ) )

# question: what happens to the information as the sample size increases?

# answer: it goes up, at a rate proportional to n

# in fact, and this is a rather general result under regularity conditions,
# with an IID sample of size n from a parametric family of sampling
# distributions, 

#   I ( theta.hat.mle ) = O ( n )

# this suggests that

#   I ( theta.hat.mle based on an IID sample of size n ) =
#     n * I ( theta.hat.mle based on an IID sample of size 1 ) ,

# which is also true rather generally

# sidebar begins below this line

# the main regularity condition needed for the last two results to hold
# is that

#   the support of the IID random variables Y_i does not depend
#   in any way on the unknown parameter

# in my versions of stat 131, i sometimes set a problem on
# one of the take-home tests in which the sampling model is

#   ( Y_i | theta B ) ~IID Uniform( 0, theta )

# this violates the regularity condition mentioned above,
# and (remarkably) the new result for this model is

#   I ( theta.hat.mle ) = O ( n^2 )

# sidebar ends above this line

# returning to the result in the IID Bernoulli sampling model,

#   I ( theta.hat.mle ) = n / ( theta.hat.mle * ( 1 - theta.hat.mle ) ) ,

# mr. fisher wants an estimated repeated-sampling variance estimate
# for theta.hat.mle; but

#   estimated variance of theta.hat.mle should go *down* as n increases,
#   whereas we've just seen that the fisher information goes *up*
#   as n grows; in other words,

#   the estimated variance of theta.hat.mle and I ( theta.hat.mle )
#   have an inverse relationship: as one goes down, the other one goes up

# question: what's the simplest form of inverse relationship?

# answer: y = 1 / x

# now look back at the right answer for the 
# estimated variance of theta.hat.mle in this case study,
# from the Bernoulli or Binomial sampling distributions ((*) above)

#   \hat{ V }_{ IID } ( theta.hat.mle ) = 
#     theta.hat.mle * ( 1 - theta.hat.mle ) / n

# from above,

#   I ( theta.hat.mle ) = n / ( theta.hat.mle * ( 1 - theta.hat.mle ) )

# this strongly suggests the conjecture

#   estimated variance of theta.hat.mle = [ I ( theta.hat.mle ) ]^( -1 ) ,

# which again turns out to be true rather generally

# this means that mr. fisher has offered us a quite general recipe,
# not only for good point estimates but also for good uncertainty bands
# around those estimates:

#   \hat{ V } ( theta.hat.mle ) = [ I ( theta.hat.mle ) ]^( -1 ) ,

# leading immediately to

#   SE.hat ( theta.hat.mle ) = sqrt{ [ I ( theta.hat.mle ) ]^( -1 ) }

# mr. fisher also conjectured in his 1922 paper that MLEs were
# the best possible estimators in a given sampling model,
# which turns out to essentially be correct

# sidebar begins below this line

# there's a later (around 1945) math result called the 
# *cramer-rao lower bound* that proves that mr. fisher's conjecture 
# was correct in a sense that's compelling from the frequentist
# point of view:

#   theorem (cramer-rao, informal statement): among all unbiased
#   estimators of an unknown parameter, the MLE achieves the
#   smallest possible repeated-sampling variance

# (as usual, the name of this result is partially wrong;
# at almost the exact same moment, the following people 
# (in alphabetical order) all published papers (working independently) 
# with this same result:

#   aitken, alexander
#   cramer, harald
#   darmois, george
#   frechet, maurice
#   rao, cr
#   silverstone, harold 

# sidebar ends above this line

# it turns out that MLEs are not in general unbiased, but
# they're approximately unbiased, and the bias decreases quickly
# as n increases:

#   E_{ IID } ( theta.hat.mle ) = theta + O ( 1 / n )

# and finally, mr. fisher also has a central limit theorem for his
# maximum likelihood estimators:

#   for large n the repeated-sampling distribution of theta.hat.mle
#   is approximately Normal with mean theta and standard error

#     SE.hat ( theta.hat.mle ) = sqrt{ [ I ( theta.hat.mle ) ]^( -1 ) }

# because of mr. fisher's antipathy toward mr. neyman, mr. fisher
# was reluctant to commit to mr. neyman's confidence interval (CI) idea 
# in print, but practitioners quickly combined the two approaches 
# and routinely computed maximum-likelihood-based CIs, which is
# a perfectly valid thing to do:

#   an approximate (large-n) 100 ( 1 - alpha )% confidence interval for theta
#   from mr. fisher's likelihood approach is

#     theta.hat.mle +/- 
#       Phi^( -1 ) ( 1 - alpha / 2 ) * sqrt{ [ I ( theta.hat.mle ) ]^( -1 ) } ,

#   in which

#     I ( theta.hat.mle ) = minus the second partial derivative
#       of the log likelihood function with respect to thete, 
#       evaluated at theta = theta.hat.mle

# this was an outstanding technology for the year 1922, and it's still
# used routinely today, both statistical and machine-learning data science

# in the quiz 2 full-employment case study, the two different-looking
# approaches to inference about theta offered by mr. neyman and mr. fisher
# yield identical results

# the advantage of mr. fisher's method is that he has no uncertainty
# about the 'best' estimator to use for the unknown parameter theta,
# namely the MLE; mr. neyman preferred to focus on *unbiased* estimators,
# which don't always exist and which sometimes produce silly
# (logically-internally-inconsistent) results

# sidebar begins below this line

#   i can show you a (slightly complicated) sampling model in which
#   we're interested in estimating a variance sigma^2,
#   which of course must be non-negative; in this model the unbiased
#   estimate of sigma^2 can come out negative

#   in general unbiasedness is a reasonable criterion when the
#   *parameter space* (the set of possible values of an unknown parameter)
#   is unbounded; if bounds exist (e.g., sigma^2 >= 0 above),
#   unbiased estimators can sometimes disobey the bound(s)

#   MLEs and bayesian estimators *always* correctly respect
#   the boundaries of the parameter space

# sidebar ends above this line

# covered up through here on 3 feb 2021

#################################################################################

##### stress test: could anything go wrong with the story we developed above?

# we saw that with ( s, n ) = ( 830, 921 ), the likelihood values
# on the vertical axis were quite small: on the order of 10^( -130 )

# let's try a stress test: multiply both n and s by 10
# and try plotting the likelihood function

n.stress.test <- 9210

s.stress.test <- 8300

plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s.stress.test, 
  n.stress.test ), type = 'l', xlab = 'theta', 
  ylab = 'Bernoulli Likelihood', lwd = 3, col = 'darkcyan' )

# question: what just happened?

# answer: this is numerical underflow at work: R can't handle numbers 
# closer to 0 than x; what's x?

.Machine

#   .
#   .

# $double.xmin
# [1] 2.225074e-308

#   .
#   .

# R is conforming to the IEEE 754 technical standard for
# floating point computation: anything of type 'numeric' 
# closer to 0 than about 2 * 10^( - 308 ) is rounded down to 0

# it turns out that the vanilla versions of wolfram alpha, 
# mathematica and maple cannot handle this either

# and neither can the CRAN R packages 'Ryacas', 'Rmpfr', and 'gmp',
# with the wonderfully-named function 'Brobdingnag'

# even if you could get an R function to compute a number
# closer to 0 than about 2 * 10^( - 308 ), the 'plot' function
# would turn it into a 0 anyway

# we've already figured out earlier how to solve this problem:

#   (1) compute the MLE theta.hat.mle = s / n

#   (2) evaluate the log likelihood function at the MLE;
#       call this number ell ell ( theta.hat.mle, s, n );
#       it's the maximum value of the log likelihood function

#   (3) add the absolute value of ell ell ( theta.hat.mle, s, n )
#       to the log likelihood function and exponentiate
#       to get a version of the likelihood function
#       that's *been scaled* so that it passes through the point
#       ( theta.hat.mle, 1 ), and plot that

print( theta.hat.mle.stress.test <- s.stress.test / n.stress.test )

# [1] 0.9011944

print( maximum.log.likelihood.value.stress.test <- 
  bernoulli.log.likelihood( theta.hat.mle.stress.test, s.stress.test,
  n.stress.test ) )

# [1] -2969.771

plot( theta.grid.3, exp( bernoulli.log.likelihood( theta.grid.3, 
  s.stress.test, n.stress.test ) + 
  abs( maximum.log.likelihood.value.stress.test ) ), 
  type = 'l', xlab = 'theta', ylab = 'Scaled Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 8300, n = 9210 )' )

# let's zero in on the (narrower) interesting region
# with these values of s and n

theta.grid.4 <- seq( 0.885, 0.915, length = 500 )

plot( theta.grid.4, exp( bernoulli.log.likelihood( theta.grid.4, 
  s.stress.test, n.stress.test ) + 
  abs( maximum.log.likelihood.value.stress.test ) ), 
  type = 'l', xlab = 'theta', ylab = 'Scaled Likelihood',
  lwd = 3, col = 'darkcyan', 
  main = 'Bernoulli sampling model ( s = 8300, n = 9210 )' )

# as noted above, and as we'll see soon in lecture,
# the bayesian interpretation of this plot is to regard it
# as a scaled version of an approximately Normal PDF for theta, 
# which turns out to be a scaled version of the posterior distribution 
# for theta with a Uniform( 0, 1 ) prior for theta (and we'll soon see 
# that this is a *low-information* prior)

# so let's superimpose a scaled Normal PDF on the scaled likelihood function
# and compare

# question: which Normal distribution? (i.e., what should we use
# for the mean and SD)

# answer: we use theta.hat.mle as the mean, and we use
# the likelihood-based estimated SE (based on information) as the SD

print( se.hat.theta.hat.mle.stress.test <- sqrt( theta.hat.mle.stress.test *
  ( 1 - theta.hat.mle.stress.test ) / n.stress.test ) )

# [1] 0.003109355

# the usual vertical scale factor for the Normal( mu, sigma^2 ) distribution
# is (of course) 1 / ( sigma * sqrt( 2 * pi ) )

# so i'll multiply the usual Normal PDF by ( sigma * sqrt( 2 * pi ) )
# to get it to go through the point ( theta.hat.mle.stress.test, 1 )

lines( theta.grid.4, se.hat.theta.hat.mle.stress.test * sqrt( 2 * pi ) *
  dnorm( theta.grid.4, theta.hat.mle.stress.test, 
  se.hat.theta.hat.mle.stress.test ), lty = 2, lwd = 3, col = 'blue' )

# this illustrates an important general result:

#   when n is large and the amount of information about theta external
#   to the data vector y = ( y_1, ..., y_n ) is small,

#     maximum likelihood = approximate bayes

# this is an informal statement of a powerful result
# known as the *bernstein-von mises theorem*

# the bayesian inferential story is much easier to interpret than 
# the frequentist inferential story: with bayes we can make
# direct probability statements about the unknown theta,
# whereas with frequency we cannot

# example: in the quiz 2 full-employment case study, mr. neyman's
# 99.9% confidence interval went from ( 86.9, 93.4 )%, and
# mr. fisher's likelihood inference gets exactly the same thing, but

#   P_F ( 0.869 < theta < 0.934 ) is not equal to 0.999, it's undefined

# whereas, if we have little information about theta external to
# the quiz-2 data set, 

#   P_B ( 0.869 < theta < 0.934 | y, low-information prior, B )
#     *IS* approximately equal to 0.999

# and (as we'll soon see) the bayesian story requires computing or
# approximating nasty integrals, i.e.,

#   maximum likelihood, which is based on *differentiation*,
#   is a lot easier to implement than bayes, which is based on *integration*

# this is why mr. fisher's technology was so perfect for 1922,
# and for many decades thereafter: even if someone wanted to
# get bayesian results in 1922, it's highly possible that they
# didn't know how to accurately approximate the nasty integrals
# (and this remained true up to about 1980)

# so we can eat our cake and have it too:

#   when n is large and the amount of information about theta external
#   to the data vector y = ( y_1, ..., y_n ) is small,

#     we can make our calculations using maximum likelihood
#     and interpret the results in a bayesian way

#################################################################################

##### functional invariance of the MLE

# here's another important feature of mr. fisher's MLE algorithm:

# suppose that you've found the MLE theta.hat.mle for a 
# (one-dimensional) parameter theta

# now you get interested in a new quantity 

#   eta = g ( theta ) ,

# where g: R -> R is a 'nice' function; here 'nice' means *invertible*

# eta = g ( theta ) is called a *transformed version* of theta;
# since g( . ) is invertible, eta and theta carry exactly the same
# information, so going from theta to eta is also called
# *reparameterizing* the sampling model

# it would certainly be great if the MLE of g ( theta ) was just
# g( . ) evaluated at the MLE of theta, i.e., if

#   eta.hat.mle = \hat{ g ( theta ) } = g ( \hat{ theta } ) = g( theta.hat.mle )

# and this turns out to be true; it's called the 

#   *functional invariance property* of maximum likelihood

# example in the quiz 2 case study: theta is a proportion or probability
# in favor of the truth of a true/false proposition A, satisfying 0 < theta < 1

# later in the course we'll see that from a modeling perspective
# a more (technically) natural parameter to be interested in, 
# in this situation, is

#   eta = g ( theta ) = log ( theta / ( 1 - theta ) ) = 
#     the log odds ratio in favor of A being true

# this is also called the *logit* transformation:

#   logit( theta ) = log ( theta / ( 1 - theta ) )

# let's look at this function

logit <- function( theta ) {

  return( log ( theta / ( 1 - theta ) ) )

}

# this function is properly vectorized

# note that logit( theta ) goes to ( - infinity ) as theta decreases to 0
# and goes to + infinity as theta increases to 1

# so let's avoid 0 and 1

theta.grid.4 <- seq( 0.0001, 0.9999, length = 500 )

plot( theta.grid.4, logit( theta.grid.4 ), type = 'l',
  lty = 1, lwd = 3, col = 'darkcyan', xlab = 'theta',
  ylab = 'logit ( theta )' )

# we could *reparameterize* the likelihood function in terms of eta
# and go through the maximization process, but we don't need to:
# g( theta ) is invertible, so

#   the MLE of eta = eta.hat.mle = logit( theta.hat.mle )
#     = log( theta.hat.mle / ( 1 - theta.hat.mle ) )

# it's not hard to show that the inverse transformation (anti-logit)
# that undoes the logit is

#   eta = log( theta / ( 1 - theta ) ) iff theta = 1 / ( 1 + exp ( - eta ) )

summary( logit( theta.grid.4 ) )

#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# -9.210  -1.098   0.000   0.000   1.098   9.210

anti.logit <- function( eta ) {

  return( 1 / ( 1 + exp ( - eta ) ) )

}

eta.grid.1 <- seq( min( logit( theta.grid.4 ) ), 
  max( logit( theta.grid.4 ) ), length = 500 )

plot( eta.grid.1, anti.logit( eta.grid.1 ), type = 'l',
  lty = 1, lwd = 3, col = 'darkcyan', xlab = 'eta',
  ylab = 'anti-logit ( eta )' )

# this looks like a CDF, and in fact it *IS*; it's the CDF of the
# *(standard) logistic distribution*, with PDF

#   Y ~ Logistic( eta ) iff 
#     p_Y ( eta ) = exp( - eta ) / ( 1 + exp( - eta ) )^2
#                 = sech^2 ( eta / 2 ) / 4

# (the last expression for fans of hyperbolic functions)

standard.logistic.pdf <- function( eta ) {

  pdf <- exp( - eta ) / ( 1 + exp( - eta ) )^2

  return( pdf )

}

plot( eta.grid.1, standard.logistic.pdf( eta.grid.1 ), type = 'l',
  lty = 1, lwd = 3, col = 'darkcyan', xlab = 'eta',
  ylab = 'Standard Logistic PDF' )

# this PDF looks rather like either a Normal or a t distribution

lines( eta.grid.1, dnorm( eta.grid.1 ), lty = 2, lwd = 3, col = 'red' )

# need to re-plot with a larger upper limit for the vertical axis:

# the mode of the standard normal distribution has height 

1 / sqrt( 2 * pi )

# [1] 0.3989423

plot( eta.grid.1, standard.logistic.pdf( eta.grid.1 ), type = 'l',
  lty = 1, lwd = 3, col = 'darkcyan', xlab = 'eta',
  ylab = 'PDF', ylim = c( 0, 0.4 ) )

# this PDF looks rather like either a Normal or a t distribution
# (the *Cauchy* distribution is the same as the t distribution
# with 1 degree of freedom)

lines( eta.grid.1, dnorm( eta.grid.1 ), lty = 2, lwd = 3, col = 'red' )

lines( eta.grid.1, dt( eta.grid.1, 1 ), lty = 3, lwd = 3, col = 'blue' )

legend( 4.0, 0.35, legend = c( 'Logistic', 'Normal', 'Cauchy' ), 
  col = c( 'darkcyan', 'red', 'blue' ), lty = 1:3, lwd = 3 )

#################################################################################

##### sufficiency; sufficient statistics

# one more important idea in the likelihood story
# that's also important in the bayesian story

# look at the form of the likelihood function in the quiz 2 case study:

#   ell ( theta | y, B ) = c * theta^s * ( 1 - theta )^( n - s )

# notice that y appears on the left-hand side of this expression
# but not on the right 

# question: why not?

# answer: we initially took y and computed its sum s;
# we could equivalently write

#   ell ( theta | y, B ) = ell( theta | s, n, B ) = ell( theta | s, B )
#     c * theta^s * ( 1 - theta )^( n - s )

# this situation is interesting: somebody offers us the entire data vector
# y = ( y_1, ..., y_n ), and to evaluate the likelihood function we
# perform a dimensionality-reduction on the data set *down from n to 1*
# (not counting n itself; by convention the sample size is not counted
# in this list of summaries of y) by calculating the sum s of the y_i values; 
# with s in hand, we no longer need the original y vector to evaluate 
# the likelihood function

# in his 1922 paper mr. fisher called this property *sufficiency*;
# he would say that *s is sufficient (or s is a *sufficient statistic*) 
# for drawing inferences about theta in the 
# IID Bernoulli( theta ) sampling model*

# as noted above, this idea is equally important to bayesians

# there's a frequentist definition of sufficiency and a different, simpler
# bayesian definition, which is the one we'll use in this class:

#   (bayesian definition of sufficiency) a function ss( y ) of the data vector
#   is *sufficient* for the unknown parameter theta iff the likelihood function
#   depends on y only through ss( y )

# question: if a sampling model has a sufficient statistic, is it unique?

# answer: NO; for example, 

# ss.1 ( y ) = sum of ( y_1, ..., y_n )           (1-dimensional) (best)

# ss.2 ( y ) = ( number of 0s, number of 1s )     (2-dimensional)

# ss.3 ( y ) = ( y_1, sum of ( y_2, ..., y_n ) )  (2-dimensional)

# ss.4 ( y ) = y itself                           (n-dimensional) 

# actually, there are always lots of sufficient statistics; for example,
# according to the definition, the data vector y is itself sufficient

# in the quiz 2 context, the 2-dimensional summary 

#   ss.2 = ( y_1, sum of y_2, ..., y_n )

# is also sufficient

# from a data-dimensionality-reduction point of view, ss.1 = s (reduction
# from n to 1) is better than ss.2 (reduction from n to 2)

# mr. fisher understood this; he asked himself, is there a *minimal*
# level of data-dimensionality-reduction below which we can't go
# and still preserve sufficiency? 

# he made another definition to help explore the answer to this question;
# again, there's a formal frequentist definition, but we'll work with
# an informal version:

#   (informal definition of *minimal sufficient statistic*) a sufficient
#   statistic ss is *minimal sufficient* if, starting with ss, 
#   no further data dimensionality reduction is possible while
#   preserving sufficiency

# here ss.1 = sum of ( y_1, ..., y_n ) is both sufficient and minimal sufficient

# in the quiz 2 setting, note that (a) the vector theta of unknowns is
# a scalar (a vector of length 1), so that the number of unknown things
# is 1, and (b) the dimensionality of ( s = the sum of the y_i values )
# is also 1

# this is the usual (regular) case: 

#   if you have k unknowns in your vector theta = ( theta_1, ..., theta_k ), 
#   for k a finite integer >= 1, and if a minimal sufficient statistic exists, 
#   it will usually also be of dimension k

# an example that we'll look at soon: 

#   ( Y_i | mu, sigma, B ) ~IID N ( mu, sigma^2 ) with mu and sigma unknown,
#                   ( i = 1, ..., n )

# Y = ( Y_1, ..., Y_n ) and y = ( y_1, ..., y_n ) (both of dimension n) 

# now

#   theta = ( mu, sigma ) or theta = ( mu, sigma^2 ) (dimension k = 2)

#   Theta = ( - infinity, + infinity ) x ( 0, + infinity )

# all minimal sufficient statistics in this model will have dimension k = 2

# here are two minimal sufficient statistics in this model:

#   ss.1 ( y ) = ( sum of y, sum of y^2 )

#   ss.2 ( y ) = ( sample mean of y, sample variance of y )

# and n is always implied as another summary of y that's necessary
# to evaluate the likelihood function

#################################################################################
#################################################################################