Skip to content
Snippets Groups Projects
presentation.Rmd 7.08 KiB
Newer Older
Laurent Modolo's avatar
Laurent Modolo committed
---
title: "Prior and Bayse's rule"
author: "Laurent Modolo"
date: "`r Sys.Date()`"
output:
  beamer_presentation:
    df_print: tibble
    fig_caption: no
    highlight: tango
    latex_engine: xelatex
    slide_level: 1
    theme: metropolis
  ioslides_presentation:
    highlight: tango
  slidy_presentation:
    highlight: tango
classoption: aspectratio=169
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
# Load packages
library(bayesrules)
library(tidyverse)
library(janitor)

# Import article data
data(fake_news)
```

# Bayes Rules

\begin{center}
\begin{columns}
\column{0.5\textwidth}

\includegraphics[width=\textwidth]{img/book_cover.jpeg}

\column{0.5\textwidth}

\href{https://www.bayesrulesbook.com/}{https://www.bayesrulesbook.com/}

\end{columns}
\end{center}

# Prior probability model

In 1996, Gary Kasparov won three, drew two, and lost one game against the IBM supercomputer Deep Blue.
Kasparov and Deep Blue were to meet again for in 1997. Let $\pi$ denote Kasparov’s chances of winning any particular game in the re-match.

Given the complexity of chess, machines, and humans, $\pi$ is unknown and can vary or fluctuate over time. Or, in short, $\pi$ is a **random variable**.

To analyse $\pi$ we will start with a prior model which

1. identifies what values $\pi$ can take
2. assigns a prior weight or probability to each
3. these probabilities sum to 1

\begin{center}
\begin{tabular}{ c c c c c }
 $\pi$ & 0.2 & 0.5 & 0.8 & Total \\ 
 $f(\pi)$ & 0.10 & 0.25 & 0.65 & 1 
\end{tabular}
\end{center}

# Kasparov’s skill level relative to that of Deep Blue

Data $Y$ is the number of the six games in the 1997 re-match that Kasparov wins. Since the chess match outcome isn’t predetermined, $Y$ is a **random variable** that can take any value in $\{0,1,\dots, 6\}$

$Y$ inherently depends upon Kasparov’s win probability $\pi$.

- If $\pi = 0.8$  Kasparov’s victories $Y$ will tend to be high
- If $\pi = 0.2$  Kasparov’s victories $Y$ will tend to be low

$Y$ *depends upon* or is *conditioned upon* the value of $\pi$

$$f(y|\pi)=P(Y=y|\pi)$$

# Binomial model

1. the outcome of any one game doesn’t influence the outcome of another
2. Kasparov has an equal probability, $\pi$, of winning any game in the match

This is a common framework in statistical analysis, one which can be represented by the Binomial model.

$$Y|\pi \sim Bin(6,\pi)$$
In the end, Kasparov only won one of the six games against Deep Blue in 1997

# Binomial model

The **pmf** of a $Bin(6, \pi)$ model is plotted for each possible value of $\pi \in \{0.2,0.5,0.8\}$. The masses marked by the black lines correspond to the eventual observed data, $Y=1$ win.

![Binom chess](img/binom-chess-1.png)

# Binomial likelihood function

The likelihood function $L(\pi|y=1)$ of observing $Y=1$ win in six games for any win probability $\pi \in \{0.2,0.5,0.8\}$. 

\begin{center}
\includegraphics[width=0.6\textwidth]{img/binom-chess-like-1.png}
\end{center}

# Probability mass functions vs likelihood functions

When $\pi$ is known, the conditional **pmf** $f(⋅|\pi)$ allows us to compare the probabilities of different possible values of data $Y$ occurring with $\pi$ :

$$f(y_1|\pi) \text{ vs } f(y_2|\pi)$$

When $Y=y$ is known, the **likelihood function** $L(⋅|y)=f(y|⋅)$ allows us to compare the relative likelihood of observing data $y$ under different possible values of $\pi$ :

$$ L(\pi_1|y) \text{ vs } L(\pi_2|y)$$

Thus, $L(⋅|y)$ provides the tool we need to evaluate the relative compatibility of data $Y=y$ with various $\pi$ values.


# Computing the posterior probability of $\pi$ given $y$

We want to balance this prior and likelihood information.

$$f(\pi | y) \propto f(\pi)L(\pi|y)$$
$L(\pi|y)$ is not a probability so we need to add a **normalizing constant**

# Baye's rule

Bayes’ Rule, requires three pieces of information: the **prior**, **likelihood**, and a **normalizing constant**.


$$f(\pi | y) = \frac{f(\pi)L(\pi|y)}{\text{normalizing constant}} = \frac{P(\pi \cap y)}{\text{normalizing constant}}$$

$P(\pi \cap y)$ is, the joint probability of observing both $\pi$ and $y$

$$\text{posterior} = \frac{\text{prior} ⋅ \text{likelihood}}{\text{normalizing constant}}$$

# Normalizing constant

We must determine the **total probability** that Kasparov would win $Y=1$ game across all possible win probabilities $\pi$, $f(y=1)$.

$$f(y = 1) = \sum_{\pi \in \{0.2,0.5,0.8\}} L(\pi | y=1) f(\pi)$$

# Porterior probability model

$$f(\pi | y=1) = \frac{f(\pi)L(\pi|y=1)}{f(y = 1)} \;\; \text{ for } \pi \in \{0.2,0.5,0.8\}$$

![posterio_likelhood_posterio_chess](img/chesssummary-1.png)

# Proportionality

$$f(\pi | y) = \frac{f(\pi)L(\pi|y)}{f(y)} \propto f(\pi)L(\pi|y)$$

![ch2scaled-1.png](img/ch2scaled-1.png)

# Beta-Binomial model

In the previous example our prior was an over-simplification:

- the possible value of $\pi$ are continuous between $[0,1]$
- We can use the $Beta$ distribution with parameter $\alpha$  and $\beta$ to model our prior

**Alison Bechdel’s 1985 comic rule:**
only see a movie if it satisfies the following three rules (Bechdel 1986):

- the movie has to have at least two women in it
- these two women talk to each other; and
- they talk about something besides a man.

What percentage of all recent movies do you think pass the Bechdel test? Is it closer to 10%, 50%, 80%, or 100%?

# Three prior models for the proportion of films that pass the Bechdel test.

![ch4-bechdel-priors-1.png](img/ch4-bechdel-priors-1.png)

$$Y | \pi \sim \text{Bin}(n, \pi)$$
$$\pi \sim \text{Beta}(\alpha, \beta)$$

# Different priors, different posteriors

We sample $n=20$ and $Y=9$ (45%) pass the Bechdel test.

![unnamed-chunk-134-1.png](img/unnamed-chunk-134-1.png)

likelihood functions are not pdfs, scaling the likelihood function here is for simplifying the visual comparisons between the prior vs data evidence about $\pi$.

# Different priors, different posteriors

We sample $n=20$ and $Y=9$ (45%) pass the Bechdel test.

![bechdel-post-ch4-1.png](img/bechdel-post-ch4-1.png)

likelihood functions are not pdfs, scaling the likelihood function here is for simplifying the visual comparisons between the prior vs data evidence about $\pi$.

# Different data, different posteriors

- $n=13$ movies from the year 1991, among which $Y=6$ (about 46%) pass the Bechdel test
- $n=63$ movies from 2000, among which $Y=29$ (about 46%) pass the Bechdel test
- $n=99$ movies from 2013, among which $Y=46$ (about 46%) pass the Bechdel test

![unnamed-chunk-138-1.png](img/unnamed-chunk-138-1.png)

# Different data, different posteriors

- $n=13$ movies from the year 1991, among which $Y=6$ (about 46%) pass the Bechdel test
- $n=63$ movies from 2000, among which $Y=29$ (about 46%) pass the Bechdel test
- $n=99$ movies from 2013, among which $Y=46$ (about 46%) pass the Bechdel test

![bechdel-data-ch4-1.png](img/bechdel-data-ch4-1.png)

# Different priors, different data, different posteriors

\begin{center}
\includegraphics[width=\textwidth]{img/bechdel-combined-ch4-1.png}
\end{center}


# Bayesian knowledge building

\begin{center}
\includegraphics[width=0.6\textwidth]{img/bayes_diagram.png}

 A Bayesian knowledge-building diagram
\end{center}