Jekyll2020-05-23T23:54:13+00:00abelborges.github.io/feed.xmlData interludeTechnical notes on data-related issues and concepts.Abel BorgesOn selection bias and sample calibration2020-04-26T00:00:00+00:002020-04-26T00:00:00+00:00abelborges.github.io/selection-bias<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <p>When using data from a sample of a population of interest to make inferences about that population, it’s important to be aware of the sampling process, i.e. how observed units in the sample were chosen.</p> <p>The goals of this post are:</p> <ol> <li>To briefly introduce probability sampling;</li> <li>To talk about systematic selection biases and how we can reason about it;</li> <li>To expose the risk of being misled by selection bias when conclusions are driven by common statistical tools;</li> <li>Finally, to demonstrate what you can do about them and help you to understand the assumptions and consequent limitations of such tools.</li> </ol> <p>This post is for anyone who is interested on these issues and has a basic understanding of probability and statistics.</p> <p>I’ll discuss these topics in a simple yet realistic setup. I’ll take advantage of the simplicity to make a deeper analysis that’s hopefully useful in understanding the technique as applied in the wild.</p> <h2 id="the-problem">The problem</h2> <p>We are interested in a finite population, or universe, <script type="math/tex">U = \{1,2,\ldots,N\}</script> of <script type="math/tex">N</script> elements, or units, and only a sample <script type="math/tex">S \subset U</script> of <script type="math/tex">n</script> elements taken from <script type="math/tex">U</script> is available. We want to use <script type="math/tex">S</script> to provide a guess, or estimate, for the populational quantity, or parameter,</p> <script type="math/tex; mode=display">\begin{equation} P(C) = \frac{N(C)}{N}, \end{equation}</script> <p>the proportion of elements of <script type="math/tex">U</script> that belong to a set <script type="math/tex">C \subset U</script>. We write <script type="math/tex">N(X)</script> and <script type="math/tex">n(X)</script> for the number of elements of <script type="math/tex">X</script> in <script type="math/tex">U</script> and <script type="math/tex">S</script>, respectively. You can also think of <script type="math/tex">P(C)</script> as the probability of sampling a random element from <script type="math/tex">U</script> and then finding out that it belongs to <script type="math/tex">C</script>. These two interpretations (the “share of something” in the population and the “probability of something” being sampled) can be invoked interchangeably througout the text whenever you see <script type="math/tex">P(\text{something})</script>. Also, recall the definition of conditional distributions:</p> <script type="math/tex; mode=display">P(X \mid Y) = \frac{P(X \cap Y)}{P(Y)},</script> <p>where <script type="math/tex">X</script> and <script type="math/tex">Y \neq \emptyset</script> are subsets of <script type="math/tex">U</script>.</p> <h2 id="the-population-in-one-figure">The population in one figure</h2> <p>That situation can be visualized below, where the whole rectangle is <script type="math/tex">U</script> and we also add that it can be partitioned into two disjoint sets <script type="math/tex">A</script> and <script type="math/tex">B = A^c</script> (just the complement of <script type="math/tex">A</script> in <script type="math/tex">U</script>), which are going to be useful in a minute.</p> <p>You can think of <script type="math/tex">A</script> as being any other characteristic of the elements of <script type="math/tex">U</script> that may be important to understand those which belong to <script type="math/tex">C</script>. Both <script type="math/tex">A</script> and <script type="math/tex">C</script> could be some thresholding on age, years of education or gender in case we are talking about humans.</p> <div style="text-align:center"> <img src="/images/selection-bias/sets.png" /> </div> <h2 id="probability-sampling">Probability sampling</h2> <p>We say that <script type="math/tex">S</script> is a probability sample if it was obtained by a <strong>controlled (i.e. known) randomization procedure</strong> that attaches to each element <script type="math/tex">i \in U</script> an inclusion probability <script type="math/tex">\pi_i</script> of it being chosen to be part of <script type="math/tex">S</script>.</p> <p>For a set <script type="math/tex">X \subset U</script>, we define the binary variable <script type="math/tex">X_i</script> as being 1 when <script type="math/tex">i \in X</script> and 0 otherwise.</p> <p>Note that <script type="math/tex">C_i</script> is just a constant that can be perfectly known as long as we have access to the unit <script type="math/tex">i</script>. On the other hand, we only get to know the <script type="math/tex">S_i</script> values once we get <script type="math/tex">S</script>, and repeating the same randomized sampling process may yield different values for the same <script type="math/tex">i</script>. If the sampling procedure itself is known, then what we know a priori is the inclusion probability <script type="math/tex">P(S_i = 1) = 1 - P(S_i = 0) = \pi_i</script>. That is, prior to sampling, <script type="math/tex">S_i</script> is a <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli random variable</a>.</p> <p>Now we may write</p> <script type="math/tex; mode=display">P(C) = \frac{N(C)}{N} = \frac{1}{N} \sum_{i \in U} C_i.</script> <p>Let’s try to employ the sample analogue of the above equation to estimate <script type="math/tex">P(C)</script> from <script type="math/tex">S</script> (just because it is pretty intuitive):</p> <script type="math/tex; mode=display">\widehat{P(C)} = \frac{1}{n} \sum_{i \in S} C_i.</script> <p>Note that <script type="math/tex">\widehat{P(C)}</script> is a random variable because it depends on <script type="math/tex">S</script>, which prior to sampling we do not know. Let’s make the dependence on the random process explicit in terms of the <script type="math/tex">S_i</script> variables (note the change from <script type="math/tex">S</script> to <script type="math/tex">U</script> in the second summation):</p> <script type="math/tex; mode=display">\widehat{P(C)} = \frac{1}{n} \sum_{i \in S} C_i = \frac{1}{n} \sum_{i \in U} C_i \cdot S_i.</script> <p>Is this an accurate estimator of <script type="math/tex">P(C)</script>? Since the <script type="math/tex">C_i</script>s are constants, it’s easy to compute the expectation of this random variable:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathrm{E} \left\{ \widehat{P(C)} \right\} &= \mathrm{E} \left\{ \frac{1}{n} \sum_{i \in U} C_i \cdot S_i \right\} \\ &= \frac{1}{n} \sum_{i \in U} C_i \cdot \mathrm{E} \left\{ S_i \right\} \\ &= \frac{1}{n} \sum_{i \in U} C_i \cdot \pi_i. \end{align} %]]></script> <p>When all units <script type="math/tex">i</script> are equally likely to be sampled, we have that <script type="math/tex">\pi_i = n/N</script> for all <script type="math/tex">i</script>. In this case, <script type="math/tex">S</script> is called a <strong>simple random sample (SRS)</strong> and the equation collapses into</p> <script type="math/tex; mode=display">\begin{align} \mathrm{E} \left\{ \widehat{P(C)} \right\} = \frac{1}{N} \sum_{i \in U} C_i = P(C). \end{align}</script> <p>That is, under a SRS, the sample proportion of cases in <script type="math/tex">C</script> is a nice guess of its populational counterpart. When an equation like this holds true, we say that the estimator is centered for the parameter.</p> <h2 id="the-horvitz-thompson-estimator">The Horvitz-Thompson estimator</h2> <p>But what if <script type="math/tex">S</script> is not a SRS? What if the elements in <script type="math/tex">U</script> are not equally likely to be in <script type="math/tex">S</script>?</p> <p>An important observation is that both <script type="math/tex">P(C)</script> and <script type="math/tex">\widehat{P(C)}</script> are linear combinations of the <script type="math/tex">C_i</script>s. Let’s write it down:</p> <script type="math/tex; mode=display">\widehat{P(C)} = \sum_{i \in S} k_i \cdot C_i,</script> <p>where the <script type="math/tex">k_i</script>s are constants. In the SRS case, we just saw that <script type="math/tex">k_i = 1/n</script> is a good choice because <script type="math/tex">\pi_i = n/N</script>. In general, we observe that the expectation of this linear estimator is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathrm{E} \left\{ \widehat{P(C)} \right\} &= \mathrm{E} \left\{ \sum_{i \in U} k_i \cdot C_i \cdot S_i \right\} \\ &= \sum_{i \in U} k_i \cdot C_i \cdot \pi_i. \end{align} %]]></script> <p>If we set <script type="math/tex">k_i = (N \pi_i)^{-1}</script>, then <script type="math/tex">\widehat{P(C)}</script> is centered for <script type="math/tex">P(C)</script>. This is known as the <a href="https://en.wikipedia.org/wiki/Horvitz%E2%80%93Thompson_estimator#Proof_of_Horvitz-Thompson_Unbiased_Estimation_of_the_Mean">Horvitz-Thompson estimator (HTE)</a>. It assumes that we know the randomization process that yields <script type="math/tex">S</script> and therefore know the inclusion probabilities <script type="math/tex">\pi_i</script> up to reasonable accuracy for all <script type="math/tex">i</script>.</p> <p>Notice that probability sampling designs may have different inclusion probabilities for different units in the population (<a href="https://en.wikipedia.org/wiki/Sampling_(statistics)#Sampling_methods">SRS is not the only probability sampling design</a>). That is simply a design choice that is commonly employed as a tool to improve the precision of the estimates under certain circumstances.</p> <p>But what if we do not know the <script type="math/tex">\pi_i</script>s? Equivalently, <strong>what if we do not know the process that yields <script type="math/tex">S</script>?</strong> Well, welcome to most real-world data analysis problems.</p> <h2 id="calibration">Calibration</h2> <p>In practice, we rarely know the structure of the sampling process. However, when we do know it, we saw that the HTE proposes to weight the data from unit <script type="math/tex">i</script> with <script type="math/tex">(N\pi_i)^{-1}</script> and this choice produces a centered estimator. That’s nicely intuitive since we may interpret <script type="math/tex">\pi_i^{-1}</script> as the number of units in <script type="math/tex">U</script> that the unit <script type="math/tex">i</script> will be responsible to represent in case it’s sampled.</p> <p>The baseline approach is usually the sample proportion, which we know to be our best guess of <script type="math/tex">P(C)</script> under SRS. How can we improve it?</p> <p>Let’s suppose that knowing that an unit belongs to <script type="math/tex">A</script> is a relevant piece of information to guess whether or not it is also in <script type="math/tex">C</script>. In other words, <script type="math/tex">A</script> and <script type="math/tex">C</script> are correlated and then the distribution <script type="math/tex">P(C \mid A)</script> is different from <script type="math/tex">P(C)</script>. We’ll denote this hypothesis by <script type="math/tex">\mathcal{H}</script>.</p> <p>From the <a href="https://en.wikipedia.org/wiki/Law_of_total_probability">law of total probability</a>, we have</p> <script type="math/tex; mode=display">\begin{align} P(C) = P(C \mid A) \ P(A) + P(C \mid B) \ P(B). \end{align}</script> <p>A similar equation is also true when we condition on the sample as well:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} P(C \mid S) &= P(C \mid A \cap S)\ P(A \mid S) + P(C \mid B \cap S) \ P(B \mid S). \end{align} %]]></script> <p>It may be useful to visualize these equations in the figure at the beginning. Note that <script type="math/tex">P(C \mid S)</script> is just the baseline estimate, the fraction of <script type="math/tex">C</script> in <script type="math/tex">S</script> once <script type="math/tex">S</script> is realized. Also, it’s informative to evaluate this last equation for the SRS case, just to convince ourselves that it makes sense:</p> <script type="math/tex; mode=display">\widehat{P_{\text{SRS}}(C)} = \frac{n(C \cap A)}{n(A)} \frac{n(A)}{n} + \frac{n(C \cap B)}{n(B)} \frac{n(B)}{n} = \frac{n(C)}{n},</script> <p>which is right! We’ve used the fact that <script type="math/tex">A</script> and <script type="math/tex">B</script> form a partition of <script type="math/tex">U</script> and we removed the conditioning on <script type="math/tex">S</script> and now we have a random variable.</p> <p>What I want to point out now is how we may improve this estimator induced by <script type="math/tex">P(C \mid S)</script> given that we do not know the structure of the sampling process but <script type="math/tex">\mathcal{H}</script> is true and we have access to auxiliary information on <script type="math/tex">P(A)</script> in the form of an alternative estimate <script type="math/tex">\widehat{P(A)}</script>.</p> <p>The (observed) bias of our baseline estimator is <script type="math/tex">P(C \mid S) - P(C)</script>. Hopefully, it’s clear that it has two sources:</p> <ol> <li><script type="math/tex">P(A \mid S) \neq P(A)</script>: Under <script type="math/tex">\mathcal{H}</script>, over- or under-representation of <script type="math/tex">A</script> in <script type="math/tex">S</script> can be a problem since we want to estimate <script type="math/tex">P(C)</script>;</li> <li><script type="math/tex">P(C \mid A \cap S) \neq P(C \mid A)</script> (similarly for <script type="math/tex">B</script>): Even when the previous point is OK, we can still suffer if there is selection bias within <script type="math/tex">A</script> and <script type="math/tex">B</script> with respect to <script type="math/tex">C</script>.</li> </ol> <p>It’s a good exercise to verify that both inequalities above turn into equalities under SRS. It’s also important to notice that these can cause problems <em>when we use the baseline fraction-of-<script type="math/tex">C</script>-in-<script type="math/tex">S</script> estimator</em> and these issues are present. In general, these conditions can be met by probability sampling designs and that would not be a problem because we would be aware of it and could employ, say, the HTE. We want to analyze the disparities between a naive guess and possible underlying populational realities.</p> <p>The idea is simply to substitute <script type="math/tex">P(A \mid S)</script> by the auxiliary estimate <script type="math/tex">\widehat{P(A)}</script>, with <script type="math/tex">\widehat{P(B)} = 1 - \widehat{P(A)}</script>, to obtain</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \widehat{P_\text{calib}(C)} &= \widehat{P(A)} \frac{n(C \cap A)}{n(A)} + \widehat{P(B)} \frac{n(C \cap B)}{n(B)} \\ &= \sum_{i \in S} \left( \frac{\widehat{P(A)} \cdot A_i \cdot C_i}{n(A)} + \frac{\widehat{P(B)} \cdot B_i \cdot C_i}{n(B)} \right). \end{align} %]]></script> <p>This technique is known as <strong>calibration</strong>. As discussed, it’s expected to work as fine as the sample approximations of the conditional distributions <script type="math/tex">P(C \mid A)</script> and <script type="math/tex">P(C \mid B)</script>.</p> <p>This estimator is also linear in <script type="math/tex">C_i</script> and the HTE coefficient <script type="math/tex">k_i = (N\pi_i)^{-1}</script> is being <em>estimated</em> as <script type="math/tex">\widehat{P(A)}/{n(A)}</script> if <script type="math/tex">i \in A</script>, and <script type="math/tex">\widehat{P(B)}/{n(B)}</script> otherwise. Also, whenever <script type="math/tex">\widehat{P(A)}</script> is a good approximation, i.e. <script type="math/tex">\widehat{P(A)} \approx P(A) = N(A)/N</script>, we get</p> <script type="math/tex; mode=display">\frac{\widehat{P(A)}}{n(A)} = \left( N \cdot \frac{n(A)}{N(A)} \right)^{-1},</script> <p>and thus we are estimating the inclusion probability <script type="math/tex">\pi_i</script> as the probability <script type="math/tex">n(A)/N(A)</script> that a random unit from <script type="math/tex">A</script> will be found to be in <script type="math/tex">S</script> (take a second to think about it).</p> <p>That’s how our assumptions plus this calibration thing are effectively modeling the sampling process.</p> <p>It seems reasonable to argue that if we have no prior information on the <script type="math/tex">\pi_i</script>s, our expectation on the statistical performance of <script type="math/tex">\widehat{P_\text{calib}(C)}</script> is always equal to or better than that of <script type="math/tex">\widehat{P_\text{SRS}(C)}</script> whenever <script type="math/tex">\widehat{P(A)}</script> is more accurate than <script type="math/tex">n(A)/n</script>.</p> <h2 id="sum-up">Sum up</h2> <p>We’ve learned that</p> <ul> <li>Under SRS, the commonly used sample proportion is a good guess for the populational proportion</li> <li>When that is not the case but we know the sampling process, the general form of the proportion estimator is given by the Horvitz-Thompson estimator, in which each unit <script type="math/tex">i</script> is assigned a weight proportional to <script type="math/tex">\pi_i^{-1}</script>;</li> <li>When we do not know the sampling process but we have access to information on auxiliary variables that correlate with the variable we want to understand, we can use this extra info to calibrate our naive sample proportion and expect its performance to improve a bit, depending on how strong is that correlation.</li> </ul> <p>In a future post, I hope to show the technique in action with real data.</p> <p>Thanks for reading.</p>Abel BorgesWhen using data from a sample of a population of interest to make inferences about that population, it's important to be aware of the sampling process, i.e. how observed units in the sample were chosen. The goals of this post are to briefly introduce probability sampling, and to talk about systematic selection biases, how we can reason about them and how we can avoid them in a simple yet realistis scenario.Intro to hypothesis testing2019-07-10T00:00:00+00:002019-07-10T00:00:00+00:00abelborges.github.io/hypothesis-testing<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <p>In this post, we walk through a toy problem to talk about the anatomy and design of a statistical test of hypothesis. We will see</p> <ul> <li>What is a p-value and how to compute it</li> <li>What are False Positive rates and why it’s important to know them well</li> <li>How to fix a conservative bias in the decisions based on the test results</li> <li>Limitations of statistical tests</li> </ul> <h2 id="toy-problem">Toy problem</h2> <p>Would you believe it if someone told you that 1 out of 10 coin tosses came up as heads? A general approach to <strong>decide</strong> is to answer “what are the odds?!”.</p> <p>The simplest ways to do that include making assumptions about how the world works and then computing the likelihood of the observed reality happening in a world like that.</p> <p>Let’s assume that each coin flip is equally likely to turn out heads or tails, independently of previous coin flips. In other words, the coin is unbiased.</p> <p><strong>If that is true</strong>, then the number of heads in a stream of 10 coin tosses follows a <a href="https://en.wikipedia.org/wiki/Binomial_distribution#Probability_mass_function">binomial distribution</a> with parameters 10 and 1/2 and we would expect that about half (that is, 5) of the coin flips would turn out heads. Below are the probabilities of observing each possible “heads” count:</p> <p><img src="/images/hypothesis-testing/heads-count.png" alt="" /></p> <p>The probability of observing exactly what our friend said (1 head) would be about <strong>0.98%</strong>.</p> <p>More importantly, we can compute the <strong>tail probability</strong> of events <strong>at least as extreme</strong> as the 1 count, that is 0, 1, 9 or 10 heads count since they are all at least as far from the expected result 5 as 1. That is, they are at least as likely outcomes as 1. The result is approximately <strong>2.15%</strong>.</p> <p>You can use that number to make a call. Is 2.15% low enough for we to disbelieve that heads and tails are equally likely? Does it make the phrase <em>“1 out of 10 were heads”</em> sound nonsense? The decision is up to you.</p> <p>You may say a number of valid things, like:</p> <ul> <li>“Since 2.15% is greater than 1%, I think the coin is indeed unbiased.”</li> <li>“That’s too unlikely! If he said the heads count was between 3 and 7 I would believe it.”</li> <li>“10 samples are not enough. I need more data to decide.”</li> </ul> <p>In any case, there is a subset of outcomes for which we would reject the idea that the coin is unbiased.</p> <h2 id="naming-stuff">Naming stuff</h2> <p>Well, we have pretty much done it! We have just built a statistical test. It required</p> <ol> <li>Hypothesizing a model of <strong>how the world works</strong>: The outcomes of coin flips are independent and equally likely</li> <li>Evaluating the odds of events at least as extreme as the <strong>observed reality</strong> happening if the model is true: The probability of observing heads once out of 10 coin tosses was evaluated to 0.98% and the probability of observing events like this or even more extreme (with respect to the initial hypothesis) was evaluated to 2.15%.</li> <li>Making a <strong>call</strong> based on the tail probability according to a pre-specified rule.</li> </ol> <p>The assumptions we make about how the world works (“the coin is unbiased”), upon which we compute the probability of interest, compose what is usually called the <strong>null hypothesis</strong>. The tail probability (2.15%) is called the <strong>p-value</strong>.</p> <p>The function of data that we use to compute the p-value (number of heads) is called the <strong>test statistic</strong>. It has a distribution under the null hypothesis (binomial, with parameters 10 and 1/2) sometimes called the <strong>null distribution</strong>. The set of outcomes who would make us doubt the null hypothesis in such a way that we reject it as a plausible description of reality (for instance, “the p-value is smaller than 1%” or “the distance of the heads count to 5 is greater than 2”) is called the <strong>critical region</strong>.</p> <p>When making a decision informed by the p-value, we may be wrong in two ways:</p> <ol> <li>If we decide the coin is biased, when actually it’s not; and</li> <li>If we decide the coin is unbiased, when in fact it is biased.</li> </ol> <p>The first one is called a <strong>False Positive (FP)</strong>. The second one is a <strong>False Negative (FN)</strong>. The null hypothesis is usually set up as the hypothetical world where</p> <ul> <li>“everything is OK”, or</li> <li>“nothing has changed”, or</li> <li>“you are healthy”, or</li> <li>“he is innocent”, or</li> <li>“there’re no differences between these groups of people”, or</li> <li>“there’s no enemy’s plane entering our borders”,</li> </ul> <p>and so on. Therefore, the term “Positive” in “False Positive” refers to the detection of “something”.</p> <p>Controlling the rates of FPs and FNs is a central task in the design of a statistical test. Those rates pose a trade-off to be solved by the designer:</p> <ul> <li><strong>I don’t want to miss anything!</strong> Being too afraid of failing to catch up any deviations from the null hypothesis leads to setting too small FN rates. That may cause in turn an increase of FPs since we may shout out “I saw something!!!” for almost anything.</li> <li><strong>I hate interruptions!</strong> In the context of quality control of industrial products, we may want to avoid stopping the machines unless something really critical is detected, and then very small FP rates will be preferred. Accordingly, that may imply increased FN rates since we are going to say “everything is just fine” and keep moving more often than we should.</li> </ul> <h2 id="how-sure-are-you-about-that-uncertainty-estimate">How sure are you about that uncertainty estimate?</h2> <p>Hypothesis testing is all about decision making. Statistical tests are probabilistic decision-informing machines. Different situations induce different levels of rigor on the required accuracy of p-values.</p> <p>The greater the risk around the decision to be made, the greater the consequences of mishandling statistics.</p> <p>An important property of a test is its <a href="https://en.wikipedia.org/wiki/Power_$$statistics$$"><strong>power</strong></a>. The power of a statistical test is the probability of rejecting the null hypothesis given that some alternative hypothesis is actually the truth.</p> <p>Let’s define the critical region for our coin-flipping problem as</p> <blockquote> <p>The set of outcomes for which the p-value is less than 5%.</p> </blockquote> <p>That means we have fixed the FP rate to 5%: we hope being wrong in only 5% of the time when rejecting (the “I see something!!!” part) the null hypothesis.</p> <p>Once we fix the FP rate, we can simulate coin tosses and estimate the probability of rejecting the null hypothesis by using our test for various combinations of the number of coin tosses and the hypothetical “heads” probability.</p> <p><img src="/images/hypothesis-testing/power-function.png" alt="" /></p> <p>We can draw some conclusions from the <strong>power function</strong> above:</p> <ul> <li>The power grows as the true p gets more and more distant from the p induced by the null hypothesis (50%)</li> <li>Detecting deviations from the null hypothesis also becomes easier with more data from the same process. For instance, if the true p is between 40% and 60%, observing 100 instead of 10 coin flips lifts our rate of correct detections from 5% to 50%.</li> <li>The threshold distance from 50% for which the power reaches 100% is a function of the number of coin tosses. From the image, we see that this threshold is approximately 35% and 25% for 50 and 100 coin flips, respectively.</li> <li>The black line marks our pre-set, desired 5% FP rate. We see that our test procedure is <em>sistematically conservative</em> for small samples in the sense that the actual FP rate is always <em>below</em> the configured 5% rate; it gets closer to 5% as the number of coin tosses grows.</li> </ul> <h3 id="what-is-the-reason-for-the-conservative-bias">What is the reason for the conservative bias?</h3> <p>The probability of events as extreme as 2 heads in 10 coin tosses (that is, observing 0, 1, 2, 8, 9 or 10 heads) is 10.94%. We saw that this number is 2.15% for events as extreme as 1. The 5% FP rate we have pre-set is in-between these two numbers. The effect of the rule</p> <blockquote> <p>Reject when p-value &lt; 5%</p> </blockquote> <p>under a <a href="https://en.wikipedia.org/wiki/List_of_probability_distributions#Discrete_distributions">discrete distribution</a> is that our actual FP rate can be much smaller than 5%. That means less-than-specified, conservative FP rates.</p> <p>In our case, the sample size impacts the gap because it is an explicit parameter of the binomial distribution. The gap becomes negligible as the number of coin tosses grows, as we can see below. It remains consistently below 5% after the sample size becomes bigger than 37; that sample size threshold is 624 if we want to guarantee a gap less than 1%.</p> <p><img src="/images/hypothesis-testing/tail-prob-gap.png" alt="" /></p> <h3 id="can-we-do-something-about-it">Can we do something about it?</h3> <p>We saw that by bounding the gap on tail probabilities of a binomial distribution we bound the conservative bias of our test. But, until now, for our case, the only solution is to increase the sample size. What if we can not gather more data? We want a test which always obeys the pre-set FP rate.</p> <p>We can think about the FP rate as the <a href="https://en.wikipedia.org/wiki/Expected_value#Finite_case">expected value</a> of a <strong>decision function</strong>. We’ve been using a function like that without realizing it because it’s trivial:</p> <ul> <li>Decision is 1 when p-value is less than 5%;</li> <li>Otherwise, decision is 0.</li> </ul> <p>The (actual) FP rate is the expected value of this function:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} 1 \cdot P(\text{p_value} < 5\%) + 0 \cdot P(\text{p_value} \geq 5\%) = P(\text{p_value} < 5\%). \end{equation} %]]></script> <p>Now, it’s easier to see the problem: <script type="math/tex">% <![CDATA[ P(\text{p_value} < 5\%) %]]></script> doesn’t necessarily yield 5% for discrete distributions as it was intended because of the gap problem we’ve discussed. In order to fix that, the idea is to randomize the decision function (w.p. is short for “with probability”):</p> <ul> <li>If p-value &lt; 5%, decision is 1 w.p. <script type="math/tex">\gamma</script> and 0 w.p. <script type="math/tex">1 - \gamma</script>;</li> <li>Otherwise, decision is 1 w.p. <script type="math/tex">1 - \gamma</script> and 0 w.p. <script type="math/tex">\gamma</script>.</li> </ul> <p>The constant <script type="math/tex">\gamma</script> is just a number chosen so that the actual FP rate, the expectation</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \gamma \cdot P(\text{p_value} < 5\%) + (1 - \gamma) \cdot \left[1 - P(\text{p_value} < 5\%)\right], \end{equation} %]]></script> <p>equals the pre-specified FP rate, say <script type="math/tex">\alpha</script>. Solving it for <script type="math/tex">\gamma</script>:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \gamma = \frac{1 - \alpha - P(\text{p_value} < 5\%)}{1 - 2 P(\text{p_value} < 5\%)}, \end{equation} %]]></script> <p>In our case, for 10 coin tosses, the probability of hitting the threshold rule “p-value &lt; 5%” is 2.15%, and it is related to observing 1 or 10 heads. Plugging that in the formula above, we obtain <script type="math/tex">\gamma = 97.02\%</script>. Therefore, when we think we “saw something” (p-value &lt; 5%), we only decide to reject the null hypothesis in 97.02% of the time. We “flip a coin” with probability <script type="math/tex">\gamma</script> to decide. Randomizing a decision may seen odd, but our expectation is that acting consistently according to this new rule, we will have better control over the FP rate.</p> <p>That is in fact what happens in the long run. Below is the power functions of both tests zoomed in the region around p = 1/2. We see that the actual FP rate now matches the pre-set FP rate in the randomized test, as intended, even for small sample sizes.</p> <p><img src="/images/hypothesis-testing/power-functions.png" alt="" /></p> <h2 id="do-you-trust-it-for-your-life">Do you trust it for your life?</h2> <h3 id="skin-in-the-game-type-of-example">Skin-in-the-game type of example</h3> <p>Real examples in more important situations are useful to realizing the relationship of the p-value to risk, and the importance of also having reliable estimates of power, besides accurate p-values.</p> <p>Let’s say you want to do a surgery for changing your appearance. Not having the surgery does not kill you. It is just a “plus”. However, the doctor says that 1% of people die in the surgery table. Would you do it?</p> <p>That answer can vary a lot amongst different people. The question of whether that 1% represents a reliable estimate of the probability of death for you in particular may be seen as too skeptical in trivial situations but probably we all agree that it’s reasonable here.</p> <p>In common practical scenarios (e.g. health, finance, or business data analysis), we miss important properties which invalidate most naive, textbook analyses.</p> <h3 id="what-is-it-for">What is it for?</h3> <p>In our coin tossing example, we could come up with different models to make sense of the frequency of heads motivated by valid questions such as</p> <ul> <li>What if coin tosses are not independent from each other?</li> <li>What if our friend doesn’t tell the truth about the observed number of tails?</li> </ul> <p>Answering that may look silly for coin tossing but similar skepticism is an important fuel to more serious research endeavors. That is a topic worth a post of its own.</p> <p>Statistical machinery to model reality and then evaluate p-values accordingly can become as complicated as the problem asks for (or as skeptical you are). The rigor is proportional to how critical is to get it right, as with everything in life.</p> <p>The important thing to have in mind is that</p> <blockquote> <p>we don’t discover truths using these procedures, we identify lies.</p> </blockquote> <p>If we have no evidence to reject the null hypothesis, we can’t be sure that it’s the ultimate truth, since that result can be explained by a multitude of hidden factors neither contained nor explained by the null hypothesis. But very small p-values in fact indicate that it’s <em>very unlikely</em> to support the (strong) claim made by the null hypothesis (“the world is like this”). <a href="https://en.wikipedia.org/wiki/Philosophy_of_science#Philosophy_of_statistics">Proving right is hard; detecting lies is easier</a>.</p> <h2 id="sum-up">Sum up</h2> <ul> <li>The basic anatomy of a statistical test: hypothesis (model how the world may work), p-value evaluation (is it plausible to observe this in a world like that?), decide (or iterate);</li> <li>It’s impossible to get it right every time: the FPs versus FNs trade-off;</li> <li>Power estimates are as important as p-values in a way similar to <a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence intervals</a> being as important as <a href="https://en.wikipedia.org/wiki/Point_estimation">point estimates</a>;</li> <li>Different problems require different levels of rigor;</li> <li>Cultivating healthy skepticism may save your life.</li> </ul> <p>Thanks for reading!</p>Abel BorgesIn this post, we walk through a toy problem to talk about the anatomy and design of a statistical test of hypothesis. We will see (1) what is a p-value and how to compute it, (2) what are False Positive rates and why it's important to know them well, (3) how to fix a conservative bias in the decisions based on the test results, and (4) discuss limitations of statistical tests.