Calculating Confidence Interval for a Percentile

Calculating the confidence interval for a percentile is a crucial step in understanding the variability and the uncertainty around the estimated value. In many real-world applications, the distribution of the data is unknown and this makes it difficult to determine the confidence intervals. In such scenarios, using a binomial distribution can be a viable alternative to estimate the confidence intervals for a percentile.

For instance, let’s consider a variable with 300 data points and we want to calculate the 70th and 90th percentiles and the corresponding confidence intervals for the variable. To do this, we can use a binomial distribution approach.

First, we need to choose an alpha level, which is a probability that determines the size of the confidence interval. A common choice for alpha is 0.05, which corresponds to a 95% confidence interval.

Next, we use the cumulative distribution function (CDF) of the binomial distribution to estimate the lower and upper bounds of the confidence interval. The CDF of the binomial distribution gives the probability of getting k or fewer successes in n independent Bernoulli trials, where the probability of success in each trial is p.

To calculate the 70th percentile and its confidence interval, we use the following steps:

  1. Set n = 300, which is the number of data points.
  2. Set p = 0.7, which corresponds to the 70th percentile.
  3. Calculate the binomial quantile using the CDF, which is the smallest k such that P(X <= k) >= p, where X is a binomial random variable with parameters n and p.
  4. Use the CDF to determine the lower and upper bounds of the confidence interval.

Below is the python code for calculating the confidence interval for the 70th percentile.

alpha – alpha is a parameter representing the significance level or confidence level for the calculation of the confidence interval. It is the probability that the confidence interval contains the true value of the parameter being estimated. The value of alpha is typically set to 0.05 or 0.01, meaning that there is a 95% or 99% chance, respectively, that the confidence interval contains the true value. In the code, alpha=0.05 is the default value for alpha, but it can be changed to a different value if desired.

n – number of observations

q – percentile value

from scipy.stats import binom
import numpy as np

alpha = 0.05
n = 300
q = 0.7

Below is the code for calculating the upper and lower bounds for the confidence interval. The u value is calculated as the ceiling of the binomial distribution’s quantile function (ppf) evaluated at 1 – alpha / 2 (1 – 0.05 / 2 = 0.975), and the value is shifted by adding an array of numbers from -2 to 2. Any values of u that are greater than n are set to infinity.

u = np.ceil(binom.ppf(1 - alpha / 2, n, q)) + np.arange(-2, 3)
u[u > n] = np.inf

l = np.ceil(binom.ppf(alpha / 2, n, q)) + np.arange(-2, 3)
l[l < 0] = -np.inf

# From the calculation of bounds, np.ceil(binom.ppf(1 - alpha / 2, n, q)) and np.ceil(binom.ppf(alpha / 2, n, q)), we obtain that
# the upper bound value is 225 and the lower bound value is 194. This means that given a sample of size 300, a binomial distribution, and # probability of success p=0.7, we are 95% certain that the number of successes will be between 194 and 225.

Next we calculate coverage of the percentiles that the bounds cover. The coverage represents a matrix of values that correspond to the probability of coverage of the confidence interval for each combination of lower and upper bounds of the interval.

The coverage calculation uses the binom.cdf function to calculate the cumulative distribution function (CDF) for the binomial distribution, which is then used to determine the coverage probability of each combination of u and l. Once the coverage matrix is calculated, the code finds the index i corresponding to the combination of u and l that gives the closest coverage probability to 1-alpha.

coverage = np.zeros((len(l), len(u)))

for i, a in enumerate(l):
    for j, b in enumerate(u):
        coverage[i, j] = binom.cdf(b - 1, n, q) - binom.cdf(a - 1, n, q)

Next we select the upper and lower bounds of the confidence interval based on the coverage of the interval. The code first checks if the maximum coverage is less than 1 minus the significance level alpha. If it is, the code selects the pair of bounds with the maximum coverage probability. Otherwise, the code selects the pair of bounds with the smallest coverage probability that is still greater than or equal to 1 minus alpha.

if np.max(coverage) < 1 - alpha:
    i = np.where(coverage == np.max(coverage))
else:
    i = np.where(coverage == np.min(coverage[coverage >= 1 - alpha]))

i_u = i[0][0]
i_l = i[1][0]

u_final = min(n, u[i_u])
u_final = max(0, int(u_final)-1)
        
l_final = min(n, l[i_l])
l_final = max(0, int(l_final)-1)

The resulting l and u are 192 and 223, respectively. Therefore if you have a sample of 300 and you want to calculate the confidence interval for a variable X, you would sort the values in ascending order, and then you would take the values of X that correspond to the 192nd and 223rd observations.

Advent of Code Day 5 – my bonus question

I am doing the Advent of Code. So far I have solved all the questions for the four previous days and part one of the question for day five. I have also created my own question for fun, the question is below:

After many hours of walking, the Elves come to a forest glade. They are quite tired and hungry, one of the elves suddenly notices that the glade is full of mushrooms. The Elves are familiar with this mushrooms species – they are edible and quite tasty. The Elves pick all of the mushrooms and are almost ready to make mushroom soup, when they remember about one tricky problem – there is a poisonous mushroom species that looks very similar and often a poisonous mushroom will grow right among the edible mushrooms.

At this point the elves have determined the molecular structure of each mushrooms that they picked. The structure always consists of five segments and each segment consists of a number and a letter.

Example: 0.9H 0.08G 0.27L 0.57M 0.84P

Each letter molecule (A – Z) has a corresponding weight, from 0 to 25. The numbers also represent additional weight units. It is therefore possible to calculate the molecular weight of each mushroom. In the above example the weight would be 0.9 + 7 + 0.08 + 6 + 0.27 + 11 + 0.57 + 12 + 0.84 + 15 = 53.66

If the structure had a negative number, such as if it would be 0.9H -0.08G 0.27L 0.57M 0.84P, then the negative segment would need to be subtracted. The weight then would be 0.9 + 7 – 0.08 – 6 + 0.27 + 11 + 0.57 + 12 + 0.84 + 15 = 41.5

The Elves are aware that the value of each segment of a mushroom comes from a process generated by ~N(12.5, 4.5) and there is no correlation between the segments. (The value of the segment is number + letter, for example 0.9H is 7.9, while -0.08G is -6.08).

The mushroom that is poisonous is definitely tricky to find for the Elves because it looks exactly the same as the edible mushrooms. BUT! The molecular structure of this mushroom gives it away! It is very unlikely that such structure would be generated by the same process as for the edible mushrooms. Find the poisonous mushroom from the input list so that the Elves can start cooking their soup.

The list of mushrooms is in the link below:

Advent of Code Day 5 bonus question input