Jekyll2020-05-23T23:54:13+00:00abelborges.github.io/feed.xmlData interludeTechnical notes on data-related issues and concepts.Abel BorgesOn selection bias and sample calibration2020-04-26T00:00:00+00:002020-04-26T00:00:00+00:00abelborges.github.io/selection-bias<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
<p>When using data from a sample of a population of interest
to make inferences about that population,
it’s important to be aware of the
sampling process, i.e.
how observed units in the sample were chosen.</p>
<p>The goals of this post are:</p>
<ol>
<li>To briefly introduce probability sampling;</li>
<li>To talk about systematic selection biases and how we can reason about it;</li>
<li>To expose the risk of being misled by selection bias
when conclusions are driven by common statistical tools;</li>
<li>Finally, to demonstrate what you can do about them
and help you to understand the assumptions and consequent
limitations of such tools.</li>
</ol>
<p>This post is for anyone who is interested on these issues
and has a basic understanding of probability and statistics.</p>
<p>I’ll discuss these topics in a simple yet realistic setup.
I’ll take advantage of the simplicity to make a deeper analysis
that’s hopefully useful in understanding the technique as
applied in the wild.</p>
<h2 id="the-problem">The problem</h2>
<p>We are interested in a finite population, or universe,
<script type="math/tex">U = \{1,2,\ldots,N\}</script> of <script type="math/tex">N</script> elements, or units,
and only a sample
<script type="math/tex">S \subset U</script> of <script type="math/tex">n</script> elements taken from <script type="math/tex">U</script> is available.
We want to use <script type="math/tex">S</script> to provide a guess, or estimate,
for the populational quantity, or parameter,</p>
<script type="math/tex; mode=display">\begin{equation}
P(C) = \frac{N(C)}{N},
\end{equation}</script>
<p>the proportion of elements of <script type="math/tex">U</script> that belong to a set <script type="math/tex">C \subset U</script>.
We write <script type="math/tex">N(X)</script> and <script type="math/tex">n(X)</script> for the number of elements
of <script type="math/tex">X</script> in <script type="math/tex">U</script> and <script type="math/tex">S</script>, respectively.
You can also think of <script type="math/tex">P(C)</script> as the probability of
sampling a random element from <script type="math/tex">U</script> and then finding out
that it belongs to <script type="math/tex">C</script>.
These two interpretations
(the “share of something” in the population and
the “probability of something” being sampled)
can be invoked interchangeably
througout the text whenever you see <script type="math/tex">P(\text{something})</script>.
Also, recall the definition of conditional distributions:</p>
<script type="math/tex; mode=display">P(X \mid Y) = \frac{P(X \cap Y)}{P(Y)},</script>
<p>where <script type="math/tex">X</script> and <script type="math/tex">Y \neq \emptyset</script> are subsets of <script type="math/tex">U</script>.</p>
<h2 id="the-population-in-one-figure">The population in one figure</h2>
<p>That situation can be visualized below,
where the whole rectangle is <script type="math/tex">U</script> and
we also add that it can be partitioned into
two disjoint sets <script type="math/tex">A</script> and <script type="math/tex">B = A^c</script>
(just the complement of <script type="math/tex">A</script> in <script type="math/tex">U</script>),
which are going to be useful in a minute.</p>
<p>You can think of <script type="math/tex">A</script> as being any other
characteristic of the elements of <script type="math/tex">U</script>
that may be important to understand those
which belong to <script type="math/tex">C</script>.
Both <script type="math/tex">A</script> and <script type="math/tex">C</script> could be
some thresholding on age, years
of education or gender in case
we are talking about humans.</p>
<div style="text-align:center">
<img src="/images/selection-bias/sets.png" />
</div>
<h2 id="probability-sampling">Probability sampling</h2>
<p>We say that <script type="math/tex">S</script> is a probability sample if
it was obtained by a <strong>controlled (i.e. known) randomization procedure</strong>
that attaches to each element <script type="math/tex">i \in U</script> an inclusion probability
<script type="math/tex">\pi_i</script> of it being chosen to be part of <script type="math/tex">S</script>.</p>
<p>For a set <script type="math/tex">X \subset U</script>, we define the binary variable
<script type="math/tex">X_i</script> as being 1 when <script type="math/tex">i \in X</script> and 0 otherwise.</p>
<p>Note that <script type="math/tex">C_i</script> is just a constant
that can be perfectly known as long as we have access
to the unit <script type="math/tex">i</script>.
On the other hand, we only get to know the
<script type="math/tex">S_i</script> values once we get <script type="math/tex">S</script>,
and repeating the same randomized sampling process
may yield different values for the same <script type="math/tex">i</script>.
If the sampling procedure itself is known,
then what we know a priori is the inclusion probability
<script type="math/tex">P(S_i = 1) = 1 - P(S_i = 0) = \pi_i</script>.
That is, prior to sampling, <script type="math/tex">S_i</script> is a
<a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli random variable</a>.</p>
<p>Now we may write</p>
<script type="math/tex; mode=display">P(C)
= \frac{N(C)}{N}
= \frac{1}{N} \sum_{i \in U} C_i.</script>
<p>Let’s try to employ the sample analogue of the above equation
to estimate <script type="math/tex">P(C)</script> from <script type="math/tex">S</script>
(just because it is pretty intuitive):</p>
<script type="math/tex; mode=display">\widehat{P(C)}
= \frac{1}{n} \sum_{i \in S} C_i.</script>
<p>Note that <script type="math/tex">\widehat{P(C)}</script> is a random variable
because it depends on <script type="math/tex">S</script>,
which prior to sampling we do not know.
Let’s make the dependence on the random process explicit
in terms of the <script type="math/tex">S_i</script> variables
(note the change from <script type="math/tex">S</script> to <script type="math/tex">U</script> in the second summation):</p>
<script type="math/tex; mode=display">\widehat{P(C)}
= \frac{1}{n} \sum_{i \in S} C_i
= \frac{1}{n} \sum_{i \in U} C_i \cdot S_i.</script>
<p>Is this an accurate estimator of <script type="math/tex">P(C)</script>?
Since the <script type="math/tex">C_i</script>s are constants, it’s easy to
compute the expectation of this random variable:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathrm{E} \left\{ \widehat{P(C)} \right\}
&= \mathrm{E} \left\{ \frac{1}{n} \sum_{i \in U} C_i \cdot S_i \right\} \\
&= \frac{1}{n} \sum_{i \in U} C_i \cdot \mathrm{E} \left\{ S_i \right\} \\
&= \frac{1}{n} \sum_{i \in U} C_i \cdot \pi_i.
\end{align} %]]></script>
<p>When all units <script type="math/tex">i</script> are equally likely to be sampled,
we have that <script type="math/tex">\pi_i = n/N</script> for all <script type="math/tex">i</script>.
In this case, <script type="math/tex">S</script> is called a <strong>simple random sample (SRS)</strong>
and the equation collapses into</p>
<script type="math/tex; mode=display">\begin{align}
\mathrm{E} \left\{ \widehat{P(C)} \right\} = \frac{1}{N} \sum_{i \in U} C_i = P(C).
\end{align}</script>
<p>That is, under a SRS,
the sample proportion of cases in <script type="math/tex">C</script> is a nice
guess of its populational counterpart.
When an equation like this holds true, we
say that the estimator is centered for the parameter.</p>
<h2 id="the-horvitz-thompson-estimator">The Horvitz-Thompson estimator</h2>
<p>But what if <script type="math/tex">S</script> is not a SRS?
What if the elements in <script type="math/tex">U</script> are not equally likely to be in <script type="math/tex">S</script>?</p>
<p>An important observation is that both
<script type="math/tex">P(C)</script> and <script type="math/tex">\widehat{P(C)}</script> are linear
combinations of the <script type="math/tex">C_i</script>s.
Let’s write it down:</p>
<script type="math/tex; mode=display">\widehat{P(C)} = \sum_{i \in S} k_i \cdot C_i,</script>
<p>where the <script type="math/tex">k_i</script>s are constants.
In the SRS case,
we just saw that <script type="math/tex">k_i = 1/n</script> is a
good choice because <script type="math/tex">\pi_i = n/N</script>.
In general, we observe that the expectation
of this linear estimator is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathrm{E} \left\{ \widehat{P(C)} \right\}
&= \mathrm{E} \left\{ \sum_{i \in U} k_i \cdot C_i \cdot S_i \right\} \\
&= \sum_{i \in U} k_i \cdot C_i \cdot \pi_i.
\end{align} %]]></script>
<p>If we set <script type="math/tex">k_i = (N \pi_i)^{-1}</script>,
then <script type="math/tex">\widehat{P(C)}</script> is centered for <script type="math/tex">P(C)</script>.
This is known as the
<a href="https://en.wikipedia.org/wiki/Horvitz%E2%80%93Thompson_estimator#Proof_of_Horvitz-Thompson_Unbiased_Estimation_of_the_Mean">Horvitz-Thompson estimator (HTE)</a>.
It assumes that we know the randomization process
that yields <script type="math/tex">S</script> and therefore know the inclusion probabilities
<script type="math/tex">\pi_i</script> up to reasonable accuracy for all <script type="math/tex">i</script>.</p>
<p>Notice that
probability sampling designs may have
different inclusion probabilities for different units
in the population
(<a href="https://en.wikipedia.org/wiki/Sampling_(statistics)#Sampling_methods">SRS is not the only probability sampling design</a>).
That is simply a design choice that is commonly employed as a
tool to improve the precision of the estimates under certain circumstances.</p>
<p>But what if we do not know the <script type="math/tex">\pi_i</script>s?
Equivalently,
<strong>what if we do not know the process that yields <script type="math/tex">S</script>?</strong>
Well, welcome to most real-world data analysis problems.</p>
<h2 id="calibration">Calibration</h2>
<p>In practice, we rarely know the structure of
the sampling process.
However, when we do know it, we saw that
the HTE proposes to weight the data
from unit <script type="math/tex">i</script> with <script type="math/tex">(N\pi_i)^{-1}</script>
and this choice produces a centered estimator.
That’s nicely intuitive since we may interpret <script type="math/tex">\pi_i^{-1}</script> as
the number of units in <script type="math/tex">U</script> that the unit <script type="math/tex">i</script>
will be responsible to represent in case it’s sampled.</p>
<p>The baseline approach is usually the sample proportion,
which we know to be our best guess of <script type="math/tex">P(C)</script>
under SRS.
How can we improve it?</p>
<p>Let’s suppose that knowing that an unit belongs to <script type="math/tex">A</script>
is a relevant piece of information to guess whether
or not it is also in <script type="math/tex">C</script>.
In other words, <script type="math/tex">A</script> and <script type="math/tex">C</script> are correlated
and then the distribution <script type="math/tex">P(C \mid A)</script> is different
from <script type="math/tex">P(C)</script>.
We’ll denote this hypothesis by <script type="math/tex">\mathcal{H}</script>.</p>
<p>From the
<a href="https://en.wikipedia.org/wiki/Law_of_total_probability">law of total probability</a>,
we have</p>
<script type="math/tex; mode=display">\begin{align}
P(C) = P(C \mid A) \ P(A) + P(C \mid B) \ P(B).
\end{align}</script>
<p>A similar equation is also true when we condition on the sample as well:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(C \mid S)
&= P(C \mid A \cap S)\ P(A \mid S) + P(C \mid B \cap S) \ P(B \mid S).
\end{align} %]]></script>
<p>It may be useful to visualize these equations in the figure at the beginning.
Note that <script type="math/tex">P(C \mid S)</script> is just the baseline estimate,
the fraction of <script type="math/tex">C</script> in <script type="math/tex">S</script> once <script type="math/tex">S</script> is realized.
Also, it’s informative to evaluate
this last equation for the SRS case,
just to convince ourselves that it makes sense:</p>
<script type="math/tex; mode=display">\widehat{P_{\text{SRS}}(C)} =
\frac{n(C \cap A)}{n(A)} \frac{n(A)}{n} +
\frac{n(C \cap B)}{n(B)} \frac{n(B)}{n} =
\frac{n(C)}{n},</script>
<p>which is right! We’ve used the fact that <script type="math/tex">A</script> and <script type="math/tex">B</script>
form a partition of <script type="math/tex">U</script>
and we removed the conditioning
on <script type="math/tex">S</script> and now we have a random variable.</p>
<p>What I want to point out now is how we may improve
this estimator induced by <script type="math/tex">P(C \mid S)</script> given that we
do not know the structure of the sampling process but
<script type="math/tex">\mathcal{H}</script> is true and we have access to
auxiliary information on <script type="math/tex">P(A)</script> in the form
of an alternative estimate <script type="math/tex">\widehat{P(A)}</script>.</p>
<p>The (observed) bias of our baseline estimator
is <script type="math/tex">P(C \mid S) - P(C)</script>.
Hopefully, it’s clear that it has two sources:</p>
<ol>
<li><script type="math/tex">P(A \mid S) \neq P(A)</script>:
Under <script type="math/tex">\mathcal{H}</script>,
over- or under-representation of <script type="math/tex">A</script> in <script type="math/tex">S</script>
can be a problem since we want to estimate <script type="math/tex">P(C)</script>;</li>
<li><script type="math/tex">P(C \mid A \cap S) \neq P(C \mid A)</script> (similarly for <script type="math/tex">B</script>):
Even when the previous point is OK, we can still suffer if there is
selection bias within <script type="math/tex">A</script> and <script type="math/tex">B</script> with respect to <script type="math/tex">C</script>.</li>
</ol>
<p>It’s a good exercise to verify that both inequalities above
turn into equalities under SRS.
It’s also important to notice that these can cause problems
<em>when we use the baseline fraction-of-<script type="math/tex">C</script>-in-<script type="math/tex">S</script> estimator</em>
and these issues are present.
In general, these conditions can be met by probability sampling designs and
that would not be a problem because we would be aware of it
and could employ, say, the HTE.
We want to analyze the disparities between a naive
guess and possible underlying populational realities.</p>
<p>The idea is simply to substitute <script type="math/tex">P(A \mid S)</script> by
the auxiliary estimate <script type="math/tex">\widehat{P(A)}</script>,
with <script type="math/tex">\widehat{P(B)} = 1 - \widehat{P(A)}</script>,
to obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\widehat{P_\text{calib}(C)}
&= \widehat{P(A)} \frac{n(C \cap A)}{n(A)} +
\widehat{P(B)} \frac{n(C \cap B)}{n(B)} \\
&= \sum_{i \in S} \left(
\frac{\widehat{P(A)} \cdot A_i \cdot C_i}{n(A)} +
\frac{\widehat{P(B)} \cdot B_i \cdot C_i}{n(B)}
\right).
\end{align} %]]></script>
<p>This technique is known as <strong>calibration</strong>.
As discussed, it’s expected to work as fine as
the sample approximations of the conditional
distributions <script type="math/tex">P(C \mid A)</script> and <script type="math/tex">P(C \mid B)</script>.</p>
<p>This estimator is also linear in <script type="math/tex">C_i</script>
and the HTE coefficient <script type="math/tex">k_i = (N\pi_i)^{-1}</script>
is being <em>estimated</em> as
<script type="math/tex">\widehat{P(A)}/{n(A)}</script> if <script type="math/tex">i \in A</script>,
and <script type="math/tex">\widehat{P(B)}/{n(B)}</script> otherwise.
Also, whenever <script type="math/tex">\widehat{P(A)}</script> is a good approximation,
i.e. <script type="math/tex">\widehat{P(A)} \approx P(A) = N(A)/N</script>,
we get</p>
<script type="math/tex; mode=display">\frac{\widehat{P(A)}}{n(A)} = \left( N \cdot \frac{n(A)}{N(A)} \right)^{-1},</script>
<p>and thus we are estimating the inclusion probability
<script type="math/tex">\pi_i</script> as the probability <script type="math/tex">n(A)/N(A)</script> that
a random unit from <script type="math/tex">A</script> will be found to be in <script type="math/tex">S</script>
(take a second to think about it).</p>
<p>That’s how our assumptions plus this calibration thing
are effectively modeling the sampling process.</p>
<p>It seems reasonable to argue that
if we have no prior information on the <script type="math/tex">\pi_i</script>s,
our expectation on the statistical performance of
<script type="math/tex">\widehat{P_\text{calib}(C)}</script> is always equal to or better than
that of <script type="math/tex">\widehat{P_\text{SRS}(C)}</script>
whenever <script type="math/tex">\widehat{P(A)}</script> is more accurate than <script type="math/tex">n(A)/n</script>.</p>
<h2 id="sum-up">Sum up</h2>
<p>We’ve learned that</p>
<ul>
<li>Under SRS, the commonly used
sample proportion is a good guess for the populational proportion</li>
<li>When that is not the case but we know the sampling process,
the general form of the proportion estimator is given by
the Horvitz-Thompson estimator, in which each unit <script type="math/tex">i</script>
is assigned a weight proportional to <script type="math/tex">\pi_i^{-1}</script>;</li>
<li>When we do not know the sampling process but we have access
to information on auxiliary variables that correlate with
the variable we want to understand, we can use this extra info
to calibrate our naive sample proportion and expect its
performance to improve a bit, depending on how strong is
that correlation.</li>
</ul>
<p>In a future post, I hope to show the technique in action with real data.</p>
<p>Thanks for reading.</p>Abel BorgesWhen using data from a sample of a population of interest to make inferences about that population, it's important to be aware of the sampling process, i.e. how observed units in the sample were chosen. The goals of this post are to briefly introduce probability sampling, and to talk about systematic selection biases, how we can reason about them and how we can avoid them in a simple yet realistis scenario.Intro to hypothesis testing2019-07-10T00:00:00+00:002019-07-10T00:00:00+00:00abelborges.github.io/hypothesis-testing<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
<p>In this post, we walk through a toy problem
to talk about the anatomy and design of
a statistical test of hypothesis.
We will see</p>
<ul>
<li>What is a p-value and how to compute it</li>
<li>What are False Positive rates and why it’s important to know them well</li>
<li>How to fix a conservative bias in the decisions based on the test results</li>
<li>Limitations of statistical tests</li>
</ul>
<h2 id="toy-problem">Toy problem</h2>
<p>Would you believe it if someone told you that 1
out of 10 coin tosses came up as heads?
A general approach to <strong>decide</strong> is to answer “what are the odds?!”.</p>
<p>The simplest ways to do that include making
assumptions about how the world works
and then computing the likelihood
of the observed reality happening
in a world like that.</p>
<p>Let’s assume that each coin flip
is equally likely to turn out heads or tails,
independently of previous coin flips.
In other words, the coin is unbiased.</p>
<p><strong>If that is true</strong>,
then the number of heads in a stream of
10 coin tosses follows a
<a href="https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function">binomial distribution</a>
with parameters 10 and 1/2
and we would expect that about half
(that is, 5) of the coin
flips would turn out heads.
Below are the probabilities of observing
each possible “heads” count:</p>
<p><img src="/images/hypothesis-testing/heads-count.png" alt="" /></p>
<p>The probability of observing exactly what our
friend said (1 head) would be about <strong>0.98%</strong>.</p>
<p>More importantly,
we can compute the <strong>tail probability</strong>
of events <strong>at least as extreme</strong> as the 1 count,
that is 0, 1, 9 or 10 heads count
since they are all at least as far from
the expected result 5 as 1.
That is, they are at least as likely outcomes as 1.
The result is approximately <strong>2.15%</strong>.</p>
<p>You can use that number to make a call.
Is 2.15% low enough for we to disbelieve
that heads and tails are equally likely?
Does it make the phrase <em>“1 out of 10 were heads”</em> sound nonsense?
The decision is up to you.</p>
<p>You may say a number of valid things, like:</p>
<ul>
<li>“Since 2.15% is greater than 1%, I think the coin is indeed unbiased.”</li>
<li>“That’s too unlikely! If he said the heads
count was between 3 and 7 I would believe it.”</li>
<li>“10 samples are not enough. I need more data to decide.”</li>
</ul>
<p>In any case, there is a subset of outcomes
for which we would reject the idea that
the coin is unbiased.</p>
<h2 id="naming-stuff">Naming stuff</h2>
<p>Well, we have pretty much done it!
We have just built a statistical test.
It required</p>
<ol>
<li>Hypothesizing a model of <strong>how the world works</strong>:
The outcomes of coin flips are independent and equally likely</li>
<li>Evaluating the odds of events at least as extreme
as the <strong>observed reality</strong> happening if the model is true:
The probability of observing heads once out of 10 coin tosses
was evaluated to 0.98% and the probability of observing events
like this or even more extreme (with respect to the initial hypothesis)
was evaluated to 2.15%.</li>
<li>Making a <strong>call</strong> based on the tail probability
according to a pre-specified rule.</li>
</ol>
<p>The assumptions we make about how the world works
(“the coin is unbiased”),
upon which we compute the probability of interest,
compose what is usually called the <strong>null hypothesis</strong>.
The tail probability (2.15%) is called the <strong>p-value</strong>.</p>
<p>The function of data that we use
to compute the p-value (number of heads)
is called the <strong>test statistic</strong>.
It has a distribution under the null hypothesis
(binomial, with parameters 10 and 1/2)
sometimes called the <strong>null distribution</strong>.
The set of outcomes who would make us
doubt the null hypothesis in such a way that
we reject it as a plausible description of
reality
(for instance, “the p-value is smaller than 1%”
or “the distance of the heads count to 5 is greater than 2”)
is called the <strong>critical region</strong>.</p>
<p>When making a decision informed by the p-value,
we may be wrong in two ways:</p>
<ol>
<li>If we decide the coin is biased, when actually it’s not; and</li>
<li>If we decide the coin is unbiased, when in fact it is biased.</li>
</ol>
<p>The first one is called a <strong>False Positive (FP)</strong>.
The second one is a <strong>False Negative (FN)</strong>.
The null hypothesis is usually set up
as the hypothetical world where</p>
<ul>
<li>“everything is OK”, or</li>
<li>“nothing has changed”, or</li>
<li>“you are healthy”, or</li>
<li>“he is innocent”, or</li>
<li>“there’re no differences between these groups of people”, or</li>
<li>“there’s no enemy’s plane entering our borders”,</li>
</ul>
<p>and so on. Therefore, the term “Positive” in “False Positive”
refers to the detection of “something”.</p>
<p>Controlling the rates of FPs and FNs
is a central task in the design of a statistical test.
Those rates pose a trade-off to be solved by the designer:</p>
<ul>
<li><strong>I don’t want to miss anything!</strong>
Being too afraid of failing to catch up any deviations
from the null hypothesis leads to setting too small FN rates.
That may cause in turn an increase of FPs since we may shout out
“I saw something!!!” for almost anything.</li>
<li><strong>I hate interruptions!</strong>
In the context of quality control of industrial products,
we may want to avoid stopping the machines unless
something really critical is detected,
and then very small FP rates will be preferred.
Accordingly, that may imply increased FN rates
since we are going to say “everything is just fine” and keep moving
more often than we should.</li>
</ul>
<h2 id="how-sure-are-you-about-that-uncertainty-estimate">How sure are you about that uncertainty estimate?</h2>
<p>Hypothesis testing is all about decision making.
Statistical tests are probabilistic decision-informing machines.
Different situations induce different levels
of rigor on the required accuracy of p-values.</p>
<p>The greater the risk around the decision to be made,
the greater the consequences of mishandling statistics.</p>
<p>An important property of a test is its
<a href="https://en.wikipedia.org/wiki/Power_\(statistics\)"><strong>power</strong></a>.
The power of a statistical test is
the probability of rejecting the null hypothesis
given that some alternative hypothesis is actually the truth.</p>
<p>Let’s define the critical region
for our coin-flipping problem as</p>
<blockquote>
<p>The set of outcomes for which
the p-value is less than 5%.</p>
</blockquote>
<p>That means we have fixed the FP rate to 5%:
we hope being wrong in only 5%
of the time when rejecting
(the “I see something!!!” part)
the null hypothesis.</p>
<p>Once we fix the FP rate, we can simulate coin tosses
and estimate the probability of rejecting the
null hypothesis by using our test
for various combinations of the number of coin tosses
and the hypothetical “heads” probability.</p>
<p><img src="/images/hypothesis-testing/power-function.png" alt="" /></p>
<p>We can draw some conclusions from the <strong>power function</strong> above:</p>
<ul>
<li>The power grows as the true p gets more and more
distant from the p induced by the null hypothesis (50%)</li>
<li>Detecting deviations from the null hypothesis
also becomes easier with more data from the same process.
For instance, if the true p is between 40% and 60%,
observing 100 instead of 10 coin flips lifts our rate
of correct detections from 5% to 50%.</li>
<li>The threshold distance from 50% for which
the power reaches 100% is a function
of the number of coin tosses.
From the image, we see that this threshold is approximately
35% and 25% for 50 and 100 coin flips, respectively.</li>
<li>The black line marks our pre-set, desired 5% FP rate.
We see that our test procedure is
<em>sistematically conservative</em> for small samples
in the sense that the actual FP rate is always <em>below</em> the
configured 5% rate; it gets closer to 5% as the
number of coin tosses grows.</li>
</ul>
<h3 id="what-is-the-reason-for-the-conservative-bias">What is the reason for the conservative bias?</h3>
<p>The probability of events as extreme
as 2 heads in 10 coin tosses
(that is, observing 0, 1, 2, 8, 9 or 10 heads)
is 10.94%.
We saw that this number is 2.15% for events as extreme as 1.
The 5% FP rate we have pre-set is in-between these two numbers.
The effect of the rule</p>
<blockquote>
<p>Reject when p-value < 5%</p>
</blockquote>
<p>under a
<a href="https://en.wikipedia.org/wiki/List_of_probability_distributions#Discrete_distributions">discrete distribution</a>
is that our actual FP rate can be much smaller than 5%.
That means less-than-specified, conservative FP rates.</p>
<p>In our case, the sample size impacts the gap
because it is an explicit parameter
of the binomial distribution.
The gap becomes negligible as the number of coin tosses
grows, as we can see below.
It remains consistently below 5% after
the sample size becomes bigger than 37;
that sample size threshold is 624 if we want
to guarantee a gap less than 1%.</p>
<p><img src="/images/hypothesis-testing/tail-prob-gap.png" alt="" /></p>
<h3 id="can-we-do-something-about-it">Can we do something about it?</h3>
<p>We saw that by bounding the gap on tail probabilities
of a binomial distribution we bound the conservative
bias of our test.
But, until now, for our case,
the only solution is to increase the sample size.
What if we can not gather more data?
We want a test which always obeys the pre-set
FP rate.</p>
<p>We can think about the FP rate as the
<a href="https://en.wikipedia.org/wiki/Expected_value#Finite_case">expected value</a>
of a <strong>decision function</strong>.
We’ve been using a function like that
without realizing it because it’s trivial:</p>
<ul>
<li>Decision is 1 when p-value is less than 5%;</li>
<li>Otherwise, decision is 0.</li>
</ul>
<p>The (actual) FP rate is the expected value of this function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
1 \cdot P(\text{p_value} < 5\%) +
0 \cdot P(\text{p_value} \geq 5\%) =
P(\text{p_value} < 5\%).
\end{equation} %]]></script>
<p>Now, it’s easier to see the problem:
<script type="math/tex">% <![CDATA[
P(\text{p_value} < 5\%) %]]></script> doesn’t necessarily
yield 5% for discrete distributions as it was intended
because of the gap problem we’ve discussed.
In order to fix that, the idea is to randomize
the decision function
(w.p. is short for “with probability”):</p>
<ul>
<li>If p-value < 5%, decision is 1 w.p. <script type="math/tex">\gamma</script> and 0 w.p. <script type="math/tex">1 - \gamma</script>;</li>
<li>Otherwise, decision is 1 w.p. <script type="math/tex">1 - \gamma</script> and 0 w.p. <script type="math/tex">\gamma</script>.</li>
</ul>
<p>The constant <script type="math/tex">\gamma</script> is just a number
chosen so that the actual FP rate, the expectation</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\gamma \cdot P(\text{p_value} < 5\%) +
(1 - \gamma) \cdot \left[1 - P(\text{p_value} < 5\%)\right],
\end{equation} %]]></script>
<p>equals the pre-specified FP rate, say <script type="math/tex">\alpha</script>. Solving it for <script type="math/tex">\gamma</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\gamma = \frac{1 - \alpha - P(\text{p_value} < 5\%)}{1 - 2 P(\text{p_value} < 5\%)},
\end{equation} %]]></script>
<p>In our case, for 10 coin tosses,
the probability of hitting the threshold rule
“p-value < 5%” is 2.15%, and it is related to
observing 1 or 10 heads.
Plugging that in the formula above, we obtain
<script type="math/tex">\gamma = 97.02\%</script>.
Therefore, when we think we “saw something” (p-value < 5%),
we only decide to reject the null hypothesis in 97.02%
of the time.
We “flip a coin” with probability <script type="math/tex">\gamma</script> to decide.
Randomizing a decision may seen odd,
but our expectation is that acting
consistently according to this new rule,
we will have better control over the FP rate.</p>
<p>That is in fact what happens in the long run.
Below is the power functions of both tests zoomed in
the region around p = 1/2.
We see that the actual FP rate now matches
the pre-set FP rate in the randomized test, as intended,
even for small sample sizes.</p>
<p><img src="/images/hypothesis-testing/power-functions.png" alt="" /></p>
<h2 id="do-you-trust-it-for-your-life">Do you trust it for your life?</h2>
<h3 id="skin-in-the-game-type-of-example">Skin-in-the-game type of example</h3>
<p>Real examples in more important situations
are useful to realizing
the relationship of the p-value to risk,
and the importance of also having reliable
estimates of power, besides accurate p-values.</p>
<p>Let’s say you want to do a surgery for changing your appearance.
Not having the surgery does not kill you. It is just a “plus”.
However, the doctor says that 1% of people die in the surgery table.
Would you do it?</p>
<p>That answer can vary a lot amongst different people.
The question of whether that
1% represents a reliable estimate of the
probability of death for you in particular
may be seen as too skeptical in trivial situations
but probably we all agree that it’s reasonable here.</p>
<p>In common practical scenarios
(e.g. health, finance, or business data analysis),
we miss important properties which invalidate most
naive, textbook analyses.</p>
<h3 id="what-is-it-for">What is it for?</h3>
<p>In our coin tossing example,
we could come up with different models
to make sense of the frequency of heads
motivated by valid questions such as</p>
<ul>
<li>What if coin tosses are not independent from each other?</li>
<li>What if our friend doesn’t tell the truth about
the observed number of tails?</li>
</ul>
<p>Answering that may look silly for coin tossing
but similar skepticism is an important fuel to
more serious research endeavors.
That is a topic worth a post of its own.</p>
<p>Statistical machinery to model reality and then
evaluate p-values accordingly can
become as complicated as the problem asks for
(or as skeptical you are).
The rigor is proportional
to how critical is to get it right,
as with everything in life.</p>
<p>The important thing to have in mind is that</p>
<blockquote>
<p>we don’t discover truths using these procedures,
we identify lies.</p>
</blockquote>
<p>If we have no evidence to reject the null hypothesis,
we can’t be sure that it’s the ultimate truth,
since that result can be explained by a multitude of
hidden factors neither contained nor explained by the null hypothesis.
But very small p-values in fact indicate that it’s <em>very unlikely</em>
to support the (strong) claim made by the
null hypothesis (“the world is like this”).
<a href="https://en.wikipedia.org/wiki/Philosophy_of_science#Philosophy_of_statistics">Proving right is hard; detecting lies is easier</a>.</p>
<h2 id="sum-up">Sum up</h2>
<ul>
<li>The basic anatomy of a statistical test:
hypothesis (model how the world may work),
p-value evaluation (is it plausible to observe this in a world like that?),
decide (or iterate);</li>
<li>It’s impossible to get it right every time: the FPs versus FNs trade-off;</li>
<li>Power estimates are as important as p-values
in a way similar to
<a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence intervals</a>
being as important
as <a href="https://en.wikipedia.org/wiki/Point_estimation">point estimates</a>;</li>
<li>Different problems require different levels of rigor;</li>
<li>Cultivating healthy skepticism may save your life.</li>
</ul>
<p>Thanks for reading!</p>Abel BorgesIn this post, we walk through a toy problem to talk about the anatomy and design of a statistical test of hypothesis. We will see (1) what is a p-value and how to compute it, (2) what are False Positive rates and why it's important to know them well, (3) how to fix a conservative bias in the decisions based on the test results, and (4) discuss limitations of statistical tests.