################################################################################# ################################################################################# # exploration in R of the bernoulli likelihood story # with the full-employment data of quiz 2 # dd (2, 3, 4 feb 2021) # preliminaries: # (1) unbuffer the output (if relevant to your R environment) # (2) set the working directory to the place where you'd like # PDF versions of plots to be stored # on my home desktop this is # setwd( 'C:/DD/Teaching/AMS-STAT-206/Winter-2021/Bernoulli-Likelihood' ) # (3) ensure that we start with a completely empty R workspace: rm( list = ls( ) ) ################################################################################# # see the paper-and-pen work on the document camera # that defines the Bernoulli likelihood function # ell ( theta | y, B ) = c * theta^s * ( 1 - theta )^( n - s ) , # in which c can be any positive number you like (and you're allowed # to move from one value of c to another if it helps you to see # what's going on) # i'll initially use c = 1 in what follows # we're assuming here that the sample size n > 0, # the number s of 'yes' answers satisfies s > 0, # the number ( n - s ) of 'not-yes' answers satisfies ( n - s ) > 0, # and 0 < theta < 1 # the cases n = 0 (no data) or ( theta = 0 or theta = 1: non-probabilistic) # are uninteresting # the cases s = 0 and s = n are interesting, but we won't examine them here ################################################################################# ##### examination of the likelihood function # input the data from quiz 2 n <- 921 s <- 830 # define the bernoulli likelihood function bernoulli.likelihood <- function( theta, s, n ) { likelihood <- theta^s * ( 1 - theta )^( n - s ) return( likelihood ) } # plot the bernoulli likelihood function # with the quiz 2 data theta.grid.1 <- seq( 0, 1, length = 500 ) plot( theta.grid.1, bernoulli.likelihood( theta.grid.1, s, n ), type = 'l', xlab = 'theta', ylab = 'Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) # stop and think about the plot # let's hold a magnifying glass up to the interesting range of theta # (where the likelihood is meaningfully positive) theta.grid.2 <- seq( 0.8, 1, length = 500 ) plot( theta.grid.2, bernoulli.likelihood( theta.grid.2, s, n ), type = 'l', xlab = 'theta', ylab = 'Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) # stop again and think about the new plot # let's continue to magnify the interesting range of theta theta.grid.3 <- seq( 0.85, 0.95, length = 500 ) plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) # this is a good plot; i would publish it to myself # there are two interesting things about this plot; what are they? # answer 1: the vertical axis involves really small (close to 0) numbers # in fact, we'll show at the end of this file that # as s and n get bigger and bigger, the likelihood values # will eventually get so small that R will report them as 0 # (in computer-science language this is an *underflow* issue) # but let's not worry about that now; let's look at mr. fisher's story # in a setting (e.g., quiz 2) in which it works just fine # one more initial item: # answer 2: the likelihood function looks a lot like a Normal PDF # for theta, treating theta as if it were a random variable # thinking of the likelihood function as an un-normalized # (not yet integrating to 1) Normal PDF for theta, # as the plot certainly invites us to do, is a bayesian, not frequentist, # interpretation of what's going on # mr. fisher, who was loudly against bayesian methods # (for a reason i'll give quite soon in lecture), # wants us to think of this function simply as a function # without a probabilistic meaning, and he wants to draw inferences # about the unknown theta solely from this function (or something # closely related to it; more on that later) # in other words, he wants (a) to define a *point estimate* # theta.hat (a single best guess for theta), (b) compute # an approximate frequentist standard error # for theta.hat, and (c) build an approximate *interval estimate* # for theta; because of his dispute with mr. neyman, # he wants to do these things in a non-mr. neyman way ################################################################################# ##### point estimation # in 1922, mr. fisher offered the world the following suggestion: # trust me (says mr. fisher), you'll get a good point estimate # of theta by finding the value theta.hat of theta that # *maximizes* the likelihood function ell ( theta | y, B ) # (fisher ra (1922) on the mathematical foundations of theoretical statistics. # Transactions of the Royal Society of London, 222, 309-368) # in math language he defines what he calls the # *maximum likelihood estimate* (MLE) with the equation # theta.hat.mle = argmax_{ theta in Theta } ell ( theta | y, B ) , # in which Theta is the *parameter space* (the set of all possible values # of the unknown (parameter) theta (in this problem Theta = ( 0, 1 ) ) # if the likelihood function has more than one local maximum, # mr. fisher says: find the global maximum # ok, let's take mr. fisher's advice: # question: how do you globally maximize a function like the one # in the most recent plot? # answer: well, you can see analytically and graphically that # the function # ell ( theta | y, B ) = theta^s * ( 1 - theta )^( n - s ) # goes to 0 as theta approaches 0 or 1, so the only maximum visible # in the plot has to be the global max # so we can simply differentiate the likelihood function # once with respect to theta and set the first derivative equal to 0 # we can do the this by hand or give it to wolfram alpha at # https://www.wolframalpha.com/ # here's a wolfram alpha command that works (remove the hashtag # # before copying and pasting into wolfram alpha): # differentiate theta^s * ( 1 - theta )^( n - s ) in theta # question: look under 'Roots'; what do you see? # answer: # wolfram alpha doesn't know the contextual information here, # we're assuming that the sample size n > 0, # the number s of 'yes' answers satisfies s > 0, # the number ( n - s ) of 'not-yes' answers satisfies ( n - s ) > 0, # and 0 < theta < 1 (otherwise the situation is non-probabilistic), # because we haven't shared that information with it # sidebar begins below this line # this can be remedied with the 'Assuming' command (remove the # hashtags # before copying and pasting into wolfram alpha) # Assuming[ n > 0 && s > 0 && ( n - s ) > 0 && theta > 0 && theta < 1, # { differentiate theta^s * ( 1 - theta )^( n - s ) in theta } ] # sidebar ends above this line # with the less fancy call to wolfram alpha, we can ignore # three of the four 'Roots' and focus on the third one, # using '!=' to stand for 'is not equal to': # n != 0, s * ( n - s ) != 0, theta = s / n # that's the answer we want: the assumptions 'n != 0, s * ( n - s ) != 0' # match our problem context, so mr. fisher's method here has yielded # the MLE, which (in R) looks like this: print( theta.hat.mle <- s / n ) # [1] 0.9011944 # notice that this is extremely intuitively reasonable here; # our best guess for the population mean theta is the sample mean # y.bar.n = s / n # in fact it's the same point estimate that mr. neyman told us to use abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' ) legend( 0.925, 9e-130, legend = 'MLE', col = 'red', lty = 2 ) # new question: given that as n and s increase we can run into # underflow problems, is there a simple transformation we can apply # to the likelihood function that would greatly diminish # that type of numerical problem? # answer: # let's define the *log likelihood function* and plot it: # ell ell ( theta | y, B ) = log( ell( theta | y, B ) ) # here after simplification this is # ell ell ( theta | y, B ) = s * log( theta ) + ( n - s ) * log( 1 - theta ) # define the bernoulli log likelihood function bernoulli.log.likelihood <- function( theta, s, n ) { log.likelihood <- s * log( theta ) + ( n - s ) * log( 1 - theta ) return( log.likelihood ) } # plot the bernoulli log likelihood function # with the quiz 2 data using the interesting theta grid: theta.grid.3 <- seq( 0.85, 0.95, length = 500 ) plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Log Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) # question: what function does that look like to you, # and why must this be true? # answer: the likelihood function looks like a Gaussian PDF for theta, # and the log of a Gaussian PDF is a bowl-shaped-down parabola # let's line up the likelihood and log likelihood functions # with the same horizontal axis: par( mfrow = c( 2, 1 ) ) plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' ) legend( 0.925, 9e-130, legend = 'MLE', col = 'red', lty = 2 ) plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Log Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) par( mfrow = c( 1, 1 ) ) # it certainly seems that the maximum of the log likelihood function # occurs at the same theta value as the maximum of the likelihood function # question: is that true in general? # that idea works here: par( mfrow = c( 2, 1 ) ) plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' ) legend( 0.925, 9e-130, legend = 'MLE', col = 'red', lty = 2 ) plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Log Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' ) legend( 0.87, -305, legend = 'MLE', col = 'red', lty = 2 ) par( mfrow = c( 1, 1 ) ) # and of course it will *always* work: log( x ) is a strictly increasing # function for x > 0, which means that *IT'S ORDER-PRESERVING*, # i.e., for all x.1 > 0 and x.2 > 0, # if x.1 < x.2 then log( x.1 ) < log( x.2 ) # thus # you have a choice in implementing mr. fisher's algorithm: # you can either (A) maximize the likelihood function or # (B) maximize the log likelihood function, and the theta that # solves problem (A) is the same as the theta that solves problem (B) # question: which of the { likelihood function, log likelihood function } # will be easier to maximize analytically? # answer: # the argument is in three simple parts: (i) with (conditionally) IID data, # the likelihood function will always be a product; # (ii) therefore the log likelihood function will always be a sum; # and (iii) it's easier to differentiate sums than products # covered to here on tue 2 feb 2021 ################################################################################# ##### uncertainty assessment for the MLE # ok, we now have a point estimate, theta.hat.mle, of theta # that mr. fisher says is 'good' (more about that later) # now we need a give or take (uncertainty assessment) for the MLE # as noted above, mr. fisher was a dyed-in-the-wool frequentist, # so he wants to compute or approximate the repeated-sampling # estimated variance \hat{ V }_{ IID } ( theta.hat.mle ), # *just from the likelihood or log likelihood function*; # he'll then take the square root to get the estimated standard error # of the MLE: # SE.hat.of.theta.hat.mle # here, of course, the random-variable version of the MLE is S / n, # in which (under our sampling model) the sum S of the n # IID Bernoulli( theta ) random variables follows # the Binomial( n, theta ) distribution # this means that, in this case study, the repeated-sampling variance # of the random variable MLE, reasoning in mr. neyman's manner, is # V_{ IID } ( S / n ) = theta * ( 1 - theta ) / n # and we can get a good estimate of this by plugging the MLE for theta # into this expression everywhere we see a theta: # \hat{ V }_{ IID } ( theta.hat.mle ) = # theta.hat.mle * ( 1 - theta.hat.mle ) / n # --> call this equation (*); we'll refer back to it later # but (as noted above) mr. fisher wants to get this result # *directly from the likelihood or log likelihood function* # that will make his estimated variance calculation quite general # and not dependent on the specific form of the MLE # to make progress we can reason as follows: # as n goes up the estimated variance of the mle should go down # in the quiz-2 case study it goes down like O ( 1 / n ) = c / n # for some positive constant c # so let's start with the quiz-2 data set (s.old = 830, n.old = 921) # and try making a new data set in which # the sample size is multiplied by 10 -- n.new = 10 * n.old -- # but the MLE remains the same; this will isolate the effect of n # so s.new = 10 * s.old # i'll now plot the log likelihood functions with these two data sets # on the same graph, but to make them comparable i'll force them # to go through the same point ( theta.hat.mle, 0 ) plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ), type = 'l', xlab = 'theta', ylab = 'Log Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 830, n = 921 )' ) abline( v = theta.hat.mle, lwd = 2, lty = 2, col = 'red' ) # at the moment this version of the log likelihood function # goes approximately through the point ( 830 / 921, - 295 ) # remember that the idea of the likelihood function includes # a degree of freedom: to get *A* likelihood function, # we can multiply the joint sampling distribution by any # positive constant c that suits us # this means, since log converts multiplication into addition, # that we can *ADD* any constant we like to the log likelihood function # (shifting it up or down vertically) # to make a version of the log likelihood function that goes through # the point ( 0, MLE ), i need to know what's the maximum value # attained by the log likelihood function # but this is easy to compute: it has to be the log likelihood function # evaluated at the MLE: print( original.data.maximum.log.likelihood.value <- bernoulli.log.likelihood( theta.hat.mle, s, n ) ) # [1] -296.9771 # so now i can create the first part of the plot that i want to make: plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ) + abs( original.data.maximum.log.likelihood.value ), type = 'l', xlab = 'theta', ylab = 'Vertically Shifted Log Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model' ) abline( v = theta.hat.mle, lwd = 3, lty = 3, col = 'red' ) abline( h = 0, lty = 1, lwd = 2, col = 'black' ) # now let's do the same thing with the new data set, # in which s and n have both been multiplied by 10 # (since the MLE is s / n, this of course leaves the MLE unchanged) print( new.data.maximum.log.likelihood.value <- bernoulli.log.likelihood( theta.hat.mle, 10 * s, 10 * n ) ) # [1] -2969.771 # question: how does this relate in this case study to # the previous maximum log likelihood value? why? # answer: it equals 10 times the old value, because the log likelihood # function in this case study is additive in s and (n - s ) # question: what should the new plot look like, with both curves # superimposed? # answer: the new log likelihood function will approach the point # ( theta.hat.mle, 0 ) much more steeply, both from the left # and from the right lines( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, 10 * s, 10 * n ) + abs( new.data.maximum.log.likelihood.value ), type = 'l', lwd = 3, lty = 2, col = 'blue' ) legend( 0.85, -12.5, legend = c( 'n = 921', 'n = 9210', 'MLE' ), col = c( 'darkcyan', 'blue', 'red' ), lty = c( 1, 2, 3 ) ) # question: how are the green and blue function similar? # how are they different? # answers: (similar) same basic shape (concave on the log likelihood scale); # they pass through the same point ( theta.hat.mle, 0 ); they have the same # first partial derivative with respect to theta at that point; # (different) they have sharply different values of # - the second partial derivative of the log likelihood function # at the MLE # let's see how this idea plays out on the likelihood scale # to get likelihood values, i just have to exponentiate # the log likelihood values in the most recent plot par( mfrow = c( 2, 1 ) ) plot( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, s, n ) + abs( original.data.maximum.log.likelihood.value ), type = 'l', xlab = 'theta', ylab = 'Vertically Shifted Log Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model' ) abline( v = theta.hat.mle, lwd = 3, lty = 3, col = 'red' ) abline( h = 0, lty = 1, lwd = 2, col = 'black' ) lines( theta.grid.3, bernoulli.log.likelihood( theta.grid.3, 10 * s, 10 * n ) + abs( new.data.maximum.log.likelihood.value ), type = 'l', lwd = 3, lty = 2, col = 'blue' ) plot( theta.grid.3, exp( bernoulli.log.likelihood( theta.grid.3, s, n ) + abs( original.data.maximum.log.likelihood.value ) ), type = 'l', xlab = 'theta', ylab = 'Vertically Scaled Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model' ) abline( v = theta.hat.mle, lwd = 3, lty = 3, col = 'red' ) lines( theta.grid.3, exp( bernoulli.log.likelihood( theta.grid.3, 10 * s, 10 * n ) + abs( new.data.maximum.log.likelihood.value ) ), type = 'l', lwd = 3, lty = 2, col = 'blue' ) par( mfrow = c( 1, 1 ) ) # ok, so what have we learned? # as n increases, holding the MLE constant, # minus the second partial derivative of the log likelihood function # with respect to theta, evaluated at the MLE, goes *UP*; # our inferential *information* about theta has gone *UP*; and therefore # our inferential *uncertainty* about theta, as measured by the # repeated-sampling variance of the (random-variable version of the) MLE # goes *DOWN* # by a different argument (involving a taylor expansions of the # log likelihood function) mr. fisher drew the same conclusion # in considerably more generality than in our examination of # the quiz 2 case study: he defined # the *information* I ( theta.hat.mle ) = minus the second partial derivative # of the log likelihood function with respect to theta, evaluated at # theta = theta.hat.mle # for a look at the taylor-series story, you could study # the reply by Deep North at stackexchange: # https://stats.stackexchange.com/questions/174600/help-with-taylor-expansion-of-log-likelihood-function # mr. fisher had a huge ego, so he wanted people to call this quantity # *fisher information*, but he knew that by academic convention # you personally can't name one of your ideas after yourself, # but he didn't have long to wait before people started calling it # fisher information # let's compute the information symbolically here, with wolfram alpha: # second derivative of - ( s * log( theta ) + ( n - s ) * log( 1 - theta ) ) with respect to theta # ( n - s ) / ( 1 - theta )^2 + s / theta^2 # evaluate ( n - s ) / ( 1 - theta )^2 + s / theta^2 at theta = s / n # n^3 / ( n * s - s^2 ) # we want this expression in terms of the MLE theta.hat.mle = s / n # divide both numerator and denominator by n^2 to get # n / ( ( s / n ) - ( s / n )^2 ) = # n / ( theta.hat.mle - theta.hat.mle^2 ) = # n / ( theta.hat.mle * ( 1 - theta.hat.mle ) ) # this is the information about theta in this sampling model: # I ( theta.hat.mle ) = n / ( theta.hat.mle * ( 1 - theta.hat.mle ) ) # question: what happens to the information as the sample size increases? # answer: it goes up, at a rate proportional to n # in fact, and this is a rather general result under regularity conditions, # with an IID sample of size n from a parametric family of sampling # distributions, # I ( theta.hat.mle ) = O ( n ) # this suggests that # I ( theta.hat.mle based on an IID sample of size n ) = # n * I ( theta.hat.mle based on an IID sample of size 1 ) , # which is also true rather generally # sidebar begins below this line # the main regularity condition needed for the last two results to hold # is that # the support of the IID random variables Y_i does not depend # in any way on the unknown parameter # in my versions of stat 131, i sometimes set a problem on # one of the take-home tests in which the sampling model is # ( Y_i | theta B ) ~IID Uniform( 0, theta ) # this violates the regularity condition mentioned above, # and (remarkably) the new result for this model is # I ( theta.hat.mle ) = O ( n^2 ) # sidebar ends above this line # returning to the result in the IID Bernoulli sampling model, # I ( theta.hat.mle ) = n / ( theta.hat.mle * ( 1 - theta.hat.mle ) ) , # mr. fisher wants an estimated repeated-sampling variance estimate # for theta.hat.mle; but # estimated variance of theta.hat.mle should go *down* as n increases, # whereas we've just seen that the fisher information goes *up* # as n grows; in other words, # the estimated variance of theta.hat.mle and I ( theta.hat.mle ) # have an inverse relationship: as one goes down, the other one goes up # question: what's the simplest form of inverse relationship? # answer: y = 1 / x # now look back at the right answer for the # estimated variance of theta.hat.mle in this case study, # from the Bernoulli or Binomial sampling distributions ((*) above) # \hat{ V }_{ IID } ( theta.hat.mle ) = # theta.hat.mle * ( 1 - theta.hat.mle ) / n # from above, # I ( theta.hat.mle ) = n / ( theta.hat.mle * ( 1 - theta.hat.mle ) ) # this strongly suggests the conjecture # estimated variance of theta.hat.mle = [ I ( theta.hat.mle ) ]^( -1 ) , # which again turns out to be true rather generally # this means that mr. fisher has offered us a quite general recipe, # not only for good point estimates but also for good uncertainty bands # around those estimates: # \hat{ V } ( theta.hat.mle ) = [ I ( theta.hat.mle ) ]^( -1 ) , # leading immediately to # SE.hat ( theta.hat.mle ) = sqrt{ [ I ( theta.hat.mle ) ]^( -1 ) } # mr. fisher also conjectured in his 1922 paper that MLEs were # the best possible estimators in a given sampling model, # which turns out to essentially be correct # sidebar begins below this line # there's a later (around 1945) math result called the # *cramer-rao lower bound* that proves that mr. fisher's conjecture # was correct in a sense that's compelling from the frequentist # point of view: # theorem (cramer-rao, informal statement): among all unbiased # estimators of an unknown parameter, the MLE achieves the # smallest possible repeated-sampling variance # (as usual, the name of this result is partially wrong; # at almost the exact same moment, the following people # (in alphabetical order) all published papers (working independently) # with this same result: # aitken, alexander # cramer, harald # darmois, george # frechet, maurice # rao, cr # silverstone, harold # sidebar ends above this line # it turns out that MLEs are not in general unbiased, but # they're approximately unbiased, and the bias decreases quickly # as n increases: # E_{ IID } ( theta.hat.mle ) = theta + O ( 1 / n ) # and finally, mr. fisher also has a central limit theorem for his # maximum likelihood estimators: # for large n the repeated-sampling distribution of theta.hat.mle # is approximately Normal with mean theta and standard error # SE.hat ( theta.hat.mle ) = sqrt{ [ I ( theta.hat.mle ) ]^( -1 ) } # because of mr. fisher's antipathy toward mr. neyman, mr. fisher # was reluctant to commit to mr. neyman's confidence interval (CI) idea # in print, but practitioners quickly combined the two approaches # and routinely computed maximum-likelihood-based CIs, which is # a perfectly valid thing to do: # an approximate (large-n) 100 ( 1 - alpha )% confidence interval for theta # from mr. fisher's likelihood approach is # theta.hat.mle +/- # Phi^( -1 ) ( 1 - alpha / 2 ) * sqrt{ [ I ( theta.hat.mle ) ]^( -1 ) } , # in which # I ( theta.hat.mle ) = minus the second partial derivative # of the log likelihood function with respect to thete, # evaluated at theta = theta.hat.mle # this was an outstanding technology for the year 1922, and it's still # used routinely today, both statistical and machine-learning data science # in the quiz 2 full-employment case study, the two different-looking # approaches to inference about theta offered by mr. neyman and mr. fisher # yield identical results # the advantage of mr. fisher's method is that he has no uncertainty # about the 'best' estimator to use for the unknown parameter theta, # namely the MLE; mr. neyman preferred to focus on *unbiased* estimators, # which don't always exist and which sometimes produce silly # (logically-internally-inconsistent) results # sidebar begins below this line # i can show you a (slightly complicated) sampling model in which # we're interested in estimating a variance sigma^2, # which of course must be non-negative; in this model the unbiased # estimate of sigma^2 can come out negative # in general unbiasedness is a reasonable criterion when the # *parameter space* (the set of possible values of an unknown parameter) # is unbounded; if bounds exist (e.g., sigma^2 >= 0 above), # unbiased estimators can sometimes disobey the bound(s) # MLEs and bayesian estimators *always* correctly respect # the boundaries of the parameter space # sidebar ends above this line # covered up through here on 3 feb 2021 ################################################################################# ##### stress test: could anything go wrong with the story we developed above? # we saw that with ( s, n ) = ( 830, 921 ), the likelihood values # on the vertical axis were quite small: on the order of 10^( -130 ) # let's try a stress test: multiply both n and s by 10 # and try plotting the likelihood function n.stress.test <- 9210 s.stress.test <- 8300 plot( theta.grid.3, bernoulli.likelihood( theta.grid.3, s.stress.test, n.stress.test ), type = 'l', xlab = 'theta', ylab = 'Bernoulli Likelihood', lwd = 3, col = 'darkcyan' ) # question: what just happened? # answer: this is numerical underflow at work: R can't handle numbers # closer to 0 than x; what's x? .Machine # . # . # $double.xmin # [1] 2.225074e-308 # . # . # R is conforming to the IEEE 754 technical standard for # floating point computation: anything of type 'numeric' # closer to 0 than about 2 * 10^( - 308 ) is rounded down to 0 # it turns out that the vanilla versions of wolfram alpha, # mathematica and maple cannot handle this either # and neither can the CRAN R packages 'Ryacas', 'Rmpfr', and 'gmp', # with the wonderfully-named function 'Brobdingnag' # even if you could get an R function to compute a number # closer to 0 than about 2 * 10^( - 308 ), the 'plot' function # would turn it into a 0 anyway # we've already figured out earlier how to solve this problem: # (1) compute the MLE theta.hat.mle = s / n # (2) evaluate the log likelihood function at the MLE; # call this number ell ell ( theta.hat.mle, s, n ); # it's the maximum value of the log likelihood function # (3) add the absolute value of ell ell ( theta.hat.mle, s, n ) # to the log likelihood function and exponentiate # to get a version of the likelihood function # that's *been scaled* so that it passes through the point # ( theta.hat.mle, 1 ), and plot that print( theta.hat.mle.stress.test <- s.stress.test / n.stress.test ) # [1] 0.9011944 print( maximum.log.likelihood.value.stress.test <- bernoulli.log.likelihood( theta.hat.mle.stress.test, s.stress.test, n.stress.test ) ) # [1] -2969.771 plot( theta.grid.3, exp( bernoulli.log.likelihood( theta.grid.3, s.stress.test, n.stress.test ) + abs( maximum.log.likelihood.value.stress.test ) ), type = 'l', xlab = 'theta', ylab = 'Scaled Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 8300, n = 9210 )' ) # let's zero in on the (narrower) interesting region # with these values of s and n theta.grid.4 <- seq( 0.885, 0.915, length = 500 ) plot( theta.grid.4, exp( bernoulli.log.likelihood( theta.grid.4, s.stress.test, n.stress.test ) + abs( maximum.log.likelihood.value.stress.test ) ), type = 'l', xlab = 'theta', ylab = 'Scaled Likelihood', lwd = 3, col = 'darkcyan', main = 'Bernoulli sampling model ( s = 8300, n = 9210 )' ) # as noted above, and as we'll see soon in lecture, # the bayesian interpretation of this plot is to regard it # as a scaled version of an approximately Normal PDF for theta, # which turns out to be a scaled version of the posterior distribution # for theta with a Uniform( 0, 1 ) prior for theta (and we'll soon see # that this is a *low-information* prior) # so let's superimpose a scaled Normal PDF on the scaled likelihood function # and compare # question: which Normal distribution? (i.e., what should we use # for the mean and SD) # answer: we use theta.hat.mle as the mean, and we use # the likelihood-based estimated SE (based on information) as the SD print( se.hat.theta.hat.mle.stress.test <- sqrt( theta.hat.mle.stress.test * ( 1 - theta.hat.mle.stress.test ) / n.stress.test ) ) # [1] 0.003109355 # the usual vertical scale factor for the Normal( mu, sigma^2 ) distribution # is (of course) 1 / ( sigma * sqrt( 2 * pi ) ) # so i'll multiply the usual Normal PDF by ( sigma * sqrt( 2 * pi ) ) # to get it to go through the point ( theta.hat.mle.stress.test, 1 ) lines( theta.grid.4, se.hat.theta.hat.mle.stress.test * sqrt( 2 * pi ) * dnorm( theta.grid.4, theta.hat.mle.stress.test, se.hat.theta.hat.mle.stress.test ), lty = 2, lwd = 3, col = 'blue' ) # this illustrates an important general result: # when n is large and the amount of information about theta external # to the data vector y = ( y_1, ..., y_n ) is small, # maximum likelihood = approximate bayes # this is an informal statement of a powerful result # known as the *bernstein-von mises theorem* # the bayesian inferential story is much easier to interpret than # the frequentist inferential story: with bayes we can make # direct probability statements about the unknown theta, # whereas with frequency we cannot # example: in the quiz 2 full-employment case study, mr. neyman's # 99.9% confidence interval went from ( 86.9, 93.4 )%, and # mr. fisher's likelihood inference gets exactly the same thing, but # P_F ( 0.869 < theta < 0.934 ) is not equal to 0.999, it's undefined # whereas, if we have little information about theta external to # the quiz-2 data set, # P_B ( 0.869 < theta < 0.934 | y, low-information prior, B ) # *IS* approximately equal to 0.999 # and (as we'll soon see) the bayesian story requires computing or # approximating nasty integrals, i.e., # maximum likelihood, which is based on *differentiation*, # is a lot easier to implement than bayes, which is based on *integration* # this is why mr. fisher's technology was so perfect for 1922, # and for many decades thereafter: even if someone wanted to # get bayesian results in 1922, it's highly possible that they # didn't know how to accurately approximate the nasty integrals # (and this remained true up to about 1980) # so we can eat our cake and have it too: # when n is large and the amount of information about theta external # to the data vector y = ( y_1, ..., y_n ) is small, # we can make our calculations using maximum likelihood # and interpret the results in a bayesian way ################################################################################# ##### functional invariance of the MLE # here's another important feature of mr. fisher's MLE algorithm: # suppose that you've found the MLE theta.hat.mle for a # (one-dimensional) parameter theta # now you get interested in a new quantity # eta = g ( theta ) , # where g: R -> R is a 'nice' function; here 'nice' means *invertible* # eta = g ( theta ) is called a *transformed version* of theta; # since g( . ) is invertible, eta and theta carry exactly the same # information, so going from theta to eta is also called # *reparameterizing* the sampling model # it would certainly be great if the MLE of g ( theta ) was just # g( . ) evaluated at the MLE of theta, i.e., if # eta.hat.mle = \hat{ g ( theta ) } = g ( \hat{ theta } ) = g( theta.hat.mle ) # and this turns out to be true; it's called the # *functional invariance property* of maximum likelihood # example in the quiz 2 case study: theta is a proportion or probability # in favor of the truth of a true/false proposition A, satisfying 0 < theta < 1 # later in the course we'll see that from a modeling perspective # a more (technically) natural parameter to be interested in, # in this situation, is # eta = g ( theta ) = log ( theta / ( 1 - theta ) ) = # the log odds ratio in favor of A being true # this is also called the *logit* transformation: # logit( theta ) = log ( theta / ( 1 - theta ) ) # let's look at this function logit <- function( theta ) { return( log ( theta / ( 1 - theta ) ) ) } # this function is properly vectorized # note that logit( theta ) goes to ( - infinity ) as theta decreases to 0 # and goes to + infinity as theta increases to 1 # so let's avoid 0 and 1 theta.grid.4 <- seq( 0.0001, 0.9999, length = 500 ) plot( theta.grid.4, logit( theta.grid.4 ), type = 'l', lty = 1, lwd = 3, col = 'darkcyan', xlab = 'theta', ylab = 'logit ( theta )' ) # we could *reparameterize* the likelihood function in terms of eta # and go through the maximization process, but we don't need to: # g( theta ) is invertible, so # the MLE of eta = eta.hat.mle = logit( theta.hat.mle ) # = log( theta.hat.mle / ( 1 - theta.hat.mle ) ) # it's not hard to show that the inverse transformation (anti-logit) # that undoes the logit is # eta = log( theta / ( 1 - theta ) ) iff theta = 1 / ( 1 + exp ( - eta ) ) summary( logit( theta.grid.4 ) ) # Min. 1st Qu. Median Mean 3rd Qu. Max. # -9.210 -1.098 0.000 0.000 1.098 9.210 anti.logit <- function( eta ) { return( 1 / ( 1 + exp ( - eta ) ) ) } eta.grid.1 <- seq( min( logit( theta.grid.4 ) ), max( logit( theta.grid.4 ) ), length = 500 ) plot( eta.grid.1, anti.logit( eta.grid.1 ), type = 'l', lty = 1, lwd = 3, col = 'darkcyan', xlab = 'eta', ylab = 'anti-logit ( eta )' ) # this looks like a CDF, and in fact it *IS*; it's the CDF of the # *(standard) logistic distribution*, with PDF # Y ~ Logistic( eta ) iff # p_Y ( eta ) = exp( - eta ) / ( 1 + exp( - eta ) )^2 # = sech^2 ( eta / 2 ) / 4 # (the last expression for fans of hyperbolic functions) standard.logistic.pdf <- function( eta ) { pdf <- exp( - eta ) / ( 1 + exp( - eta ) )^2 return( pdf ) } plot( eta.grid.1, standard.logistic.pdf( eta.grid.1 ), type = 'l', lty = 1, lwd = 3, col = 'darkcyan', xlab = 'eta', ylab = 'Standard Logistic PDF' ) # this PDF looks rather like either a Normal or a t distribution lines( eta.grid.1, dnorm( eta.grid.1 ), lty = 2, lwd = 3, col = 'red' ) # need to re-plot with a larger upper limit for the vertical axis: # the mode of the standard normal distribution has height 1 / sqrt( 2 * pi ) # [1] 0.3989423 plot( eta.grid.1, standard.logistic.pdf( eta.grid.1 ), type = 'l', lty = 1, lwd = 3, col = 'darkcyan', xlab = 'eta', ylab = 'PDF', ylim = c( 0, 0.4 ) ) # this PDF looks rather like either a Normal or a t distribution # (the *Cauchy* distribution is the same as the t distribution # with 1 degree of freedom) lines( eta.grid.1, dnorm( eta.grid.1 ), lty = 2, lwd = 3, col = 'red' ) lines( eta.grid.1, dt( eta.grid.1, 1 ), lty = 3, lwd = 3, col = 'blue' ) legend( 4.0, 0.35, legend = c( 'Logistic', 'Normal', 'Cauchy' ), col = c( 'darkcyan', 'red', 'blue' ), lty = 1:3, lwd = 3 ) ################################################################################# ##### sufficiency; sufficient statistics # one more important idea in the likelihood story # that's also important in the bayesian story # look at the form of the likelihood function in the quiz 2 case study: # ell ( theta | y, B ) = c * theta^s * ( 1 - theta )^( n - s ) # notice that y appears on the left-hand side of this expression # but not on the right # question: why not? # answer: we initially took y and computed its sum s; # we could equivalently write # ell ( theta | y, B ) = ell( theta | s, n, B ) = ell( theta | s, B ) # c * theta^s * ( 1 - theta )^( n - s ) # this situation is interesting: somebody offers us the entire data vector # y = ( y_1, ..., y_n ), and to evaluate the likelihood function we # perform a dimensionality-reduction on the data set *down from n to 1* # (not counting n itself; by convention the sample size is not counted # in this list of summaries of y) by calculating the sum s of the y_i values; # with s in hand, we no longer need the original y vector to evaluate # the likelihood function # in his 1922 paper mr. fisher called this property *sufficiency*; # he would say that *s is sufficient (or s is a *sufficient statistic*) # for drawing inferences about theta in the # IID Bernoulli( theta ) sampling model* # as noted above, this idea is equally important to bayesians # there's a frequentist definition of sufficiency and a different, simpler # bayesian definition, which is the one we'll use in this class: # (bayesian definition of sufficiency) a function ss( y ) of the data vector # is *sufficient* for the unknown parameter theta iff the likelihood function # depends on y only through ss( y ) # question: if a sampling model has a sufficient statistic, is it unique? # answer: NO; for example, # ss.1 ( y ) = sum of ( y_1, ..., y_n ) (1-dimensional) (best) # ss.2 ( y ) = ( number of 0s, number of 1s ) (2-dimensional) # ss.3 ( y ) = ( y_1, sum of ( y_2, ..., y_n ) ) (2-dimensional) # ss.4 ( y ) = y itself (n-dimensional) # actually, there are always lots of sufficient statistics; for example, # according to the definition, the data vector y is itself sufficient # in the quiz 2 context, the 2-dimensional summary # ss.2 = ( y_1, sum of y_2, ..., y_n ) # is also sufficient # from a data-dimensionality-reduction point of view, ss.1 = s (reduction # from n to 1) is better than ss.2 (reduction from n to 2) # mr. fisher understood this; he asked himself, is there a *minimal* # level of data-dimensionality-reduction below which we can't go # and still preserve sufficiency? # he made another definition to help explore the answer to this question; # again, there's a formal frequentist definition, but we'll work with # an informal version: # (informal definition of *minimal sufficient statistic*) a sufficient # statistic ss is *minimal sufficient* if, starting with ss, # no further data dimensionality reduction is possible while # preserving sufficiency # here ss.1 = sum of ( y_1, ..., y_n ) is both sufficient and minimal sufficient # in the quiz 2 setting, note that (a) the vector theta of unknowns is # a scalar (a vector of length 1), so that the number of unknown things # is 1, and (b) the dimensionality of ( s = the sum of the y_i values ) # is also 1 # this is the usual (regular) case: # if you have k unknowns in your vector theta = ( theta_1, ..., theta_k ), # for k a finite integer >= 1, and if a minimal sufficient statistic exists, # it will usually also be of dimension k # an example that we'll look at soon: # ( Y_i | mu, sigma, B ) ~IID N ( mu, sigma^2 ) with mu and sigma unknown, # ( i = 1, ..., n ) # Y = ( Y_1, ..., Y_n ) and y = ( y_1, ..., y_n ) (both of dimension n) # now # theta = ( mu, sigma ) or theta = ( mu, sigma^2 ) (dimension k = 2) # Theta = ( - infinity, + infinity ) x ( 0, + infinity ) # all minimal sufficient statistics in this model will have dimension k = 2 # here are two minimal sufficient statistics in this model: # ss.1 ( y ) = ( sum of y, sum of y^2 ) # ss.2 ( y ) = ( sample mean of y, sample variance of y ) # and n is always implied as another summary of y that's necessary # to evaluate the likelihood function ################################################################################# #################################################################################