Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---
title: "Prior and Bayse's rule"
author: "Laurent Modolo"
date: "`r Sys.Date()`"
output:
beamer_presentation:
df_print: tibble
fig_caption: no
highlight: tango
latex_engine: xelatex
slide_level: 1
theme: metropolis
ioslides_presentation:
highlight: tango
slidy_presentation:
highlight: tango
classoption: aspectratio=169
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
# Load packages
library(bayesrules)
library(tidyverse)
library(janitor)
# Import article data
data(fake_news)
```
# Bayes Rules
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\includegraphics[width=\textwidth]{img/book_cover.jpeg}
\column{0.5\textwidth}
\href{https://www.bayesrulesbook.com/}{https://www.bayesrulesbook.com/}
\end{columns}
\end{center}
# Prior probability model
In 1996, Gary Kasparov won three, drew two, and lost one game against the IBM supercomputer Deep Blue.
Kasparov and Deep Blue were to meet again for in 1997. Let $\pi$ denote Kasparov’s chances of winning any particular game in the re-match.
Given the complexity of chess, machines, and humans, $\pi$ is unknown and can vary or fluctuate over time. Or, in short, $\pi$ is a **random variable**.
To analyse $\pi$ we will start with a prior model which
1. identifies what values $\pi$ can take
2. assigns a prior weight or probability to each
3. these probabilities sum to 1
\begin{center}
\begin{tabular}{ c c c c c }
$\pi$ & 0.2 & 0.5 & 0.8 & Total \\
$f(\pi)$ & 0.10 & 0.25 & 0.65 & 1
\end{tabular}
\end{center}
# Kasparov’s skill level relative to that of Deep Blue
Data $Y$ is the number of the six games in the 1997 re-match that Kasparov wins. Since the chess match outcome isn’t predetermined, $Y$ is a **random variable** that can take any value in $\{0,1,\dots, 6\}$
$Y$ inherently depends upon Kasparov’s win probability $\pi$.
- If $\pi = 0.8$ Kasparov’s victories $Y$ will tend to be high
- If $\pi = 0.2$ Kasparov’s victories $Y$ will tend to be low
$Y$ *depends upon* or is *conditioned upon* the value of $\pi$
$$f(y|\pi)=P(Y=y|\pi)$$
# Binomial model
1. the outcome of any one game doesn’t influence the outcome of another
2. Kasparov has an equal probability, $\pi$, of winning any game in the match
This is a common framework in statistical analysis, one which can be represented by the Binomial model.
$$Y|\pi \sim Bin(6,\pi)$$
In the end, Kasparov only won one of the six games against Deep Blue in 1997
# Binomial model
The **pmf** of a $Bin(6, \pi)$ model is plotted for each possible value of $\pi \in \{0.2,0.5,0.8\}$. The masses marked by the black lines correspond to the eventual observed data, $Y=1$ win.

# Binomial likelihood function
The likelihood function $L(\pi|y=1)$ of observing $Y=1$ win in six games for any win probability $\pi \in \{0.2,0.5,0.8\}$.
\begin{center}
\includegraphics[width=0.6\textwidth]{img/binom-chess-like-1.png}
\end{center}
# Probability mass functions vs likelihood functions
When $\pi$ is known, the conditional **pmf** $f(⋅|\pi)$ allows us to compare the probabilities of different possible values of data $Y$ occurring with $\pi$ :
$$f(y_1|\pi) \text{ vs } f(y_2|\pi)$$
When $Y=y$ is known, the **likelihood function** $L(⋅|y)=f(y|⋅)$ allows us to compare the relative likelihood of observing data $y$ under different possible values of $\pi$ :
$$ L(\pi_1|y) \text{ vs } L(\pi_2|y)$$
Thus, $L(⋅|y)$ provides the tool we need to evaluate the relative compatibility of data $Y=y$ with various $\pi$ values.
# Computing the posterior probability of $\pi$ given $y$
We want to balance this prior and likelihood information.
$$f(\pi | y) \propto f(\pi)L(\pi|y)$$
$L(\pi|y)$ is not a probability so we need to add a **normalizing constant**
# Baye's rule
Bayes’ Rule, requires three pieces of information: the **prior**, **likelihood**, and a **normalizing constant**.
$$f(\pi | y) = \frac{f(\pi)L(\pi|y)}{\text{normalizing constant}} = \frac{P(\pi \cap y)}{\text{normalizing constant}}$$
$P(\pi \cap y)$ is, the joint probability of observing both $\pi$ and $y$
$$\text{posterior} = \frac{\text{prior} ⋅ \text{likelihood}}{\text{normalizing constant}}$$
# Normalizing constant
We must determine the **total probability** that Kasparov would win $Y=1$ game across all possible win probabilities $\pi$, $f(y=1)$.
$$f(y = 1) = \sum_{\pi \in \{0.2,0.5,0.8\}} L(\pi | y=1) f(\pi)$$
# Porterior probability model
$$f(\pi | y=1) = \frac{f(\pi)L(\pi|y=1)}{f(y = 1)} \;\; \text{ for } \pi \in \{0.2,0.5,0.8\}$$

# Proportionality
$$f(\pi | y) = \frac{f(\pi)L(\pi|y)}{f(y)} \propto f(\pi)L(\pi|y)$$

# Beta-Binomial model
In the previous example our prior was an over-simplification:
- the possible value of $\pi$ are continuous between $[0,1]$
- We can use the $Beta$ distribution with parameter $\alpha$ and $\beta$ to model our prior
**Alison Bechdel’s 1985 comic rule:**
only see a movie if it satisfies the following three rules (Bechdel 1986):
- the movie has to have at least two women in it
- these two women talk to each other; and
- they talk about something besides a man.
What percentage of all recent movies do you think pass the Bechdel test? Is it closer to 10%, 50%, 80%, or 100%?
# Three prior models for the proportion of films that pass the Bechdel test.

$$Y | \pi \sim \text{Bin}(n, \pi)$$
$$\pi \sim \text{Beta}(\alpha, \beta)$$
# Different priors, different posteriors
We sample $n=20$ and $Y=9$ (45%) pass the Bechdel test.

likelihood functions are not pdfs, scaling the likelihood function here is for simplifying the visual comparisons between the prior vs data evidence about $\pi$.
# Different priors, different posteriors
We sample $n=20$ and $Y=9$ (45%) pass the Bechdel test.

likelihood functions are not pdfs, scaling the likelihood function here is for simplifying the visual comparisons between the prior vs data evidence about $\pi$.
# Different data, different posteriors
- $n=13$ movies from the year 1991, among which $Y=6$ (about 46%) pass the Bechdel test
- $n=63$ movies from 2000, among which $Y=29$ (about 46%) pass the Bechdel test
- $n=99$ movies from 2013, among which $Y=46$ (about 46%) pass the Bechdel test

# Different data, different posteriors
- $n=13$ movies from the year 1991, among which $Y=6$ (about 46%) pass the Bechdel test
- $n=63$ movies from 2000, among which $Y=29$ (about 46%) pass the Bechdel test
- $n=99$ movies from 2013, among which $Y=46$ (about 46%) pass the Bechdel test

# Different priors, different data, different posteriors
\begin{center}
\includegraphics[width=\textwidth]{img/bechdel-combined-ch4-1.png}
\end{center}
# Bayesian knowledge building
\begin{center}
\includegraphics[width=0.6\textwidth]{img/bayes_diagram.png}
A Bayesian knowledge-building diagram
\end{center}